G

All Knowledge

Articles, tutorials & news on AI Quality, Security & Compliance

Recent content

Phare LLM Benchmark - an analysis of hallucination in leading LLMs
News

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

We're sharing the first results from Phare, our multilingual benchmark for evaluating language models. The benchmark research reveals leading LLMs confidently produce factually inaccurate information. Our evaluation of top models from eight AI labs shows they generate authoritative-sounding responses containing completely fabricated details, particularly when handling misinformation.

Matteo Dora - Machine Learning Researcher
Matteo Dora
View post
Testing LLM Agents through continuous Red Teaming
Tutorials

How to implement LLM as a Judge to test AI Agents? (Part 2)

Testing AI agents effectively requires automated systems that can evaluate responses across several scenarios. In this second part of our tutorial, we'll explore how to automate test execution and implement continuous red teaming for LLM agents. Learn to systematically evaluate agentic AI systems, interpret results, and maintain security through ongoing testing as your AI application evolves.

Jean-Marie John-Mathews
Jean-Marie John-Mathews, Ph.D.
View post
Implementing LLM as a Judge to test AI agents
Tutorials

How to implement LLM as a Judge to test AI Agents? (Part 1)

Testing AI agents effectively requires automated systems that can evaluate responses across several scenarios. In this first part of our tutorial, we introduce a systematic approach using LLM as a judge to detect hallucinations and security vulnerabilities before deployment. Learn how to generate synthetic test data and implement business annotation processes for exhaustive AI agent testing.

Jean-Marie John-Mathews
Jean-Marie John-Mathews, Ph.D.
View post