

When your LLM vulnerability scanner detects a threat, it relies on an LLM judge to decide whether the attack succeeded. Using one LLM to evaluate another can bring some failures into your evaluation pipeline (false positives, model drift, or context blindness). This article walks through how we meta-evaluated our own LLM-as-a-judge using giskard-checks to freeze expected verdicts, replay attack traces, and detect evaluator regressions in CI.


Since its creation, Giskard has been on a mission to help teams ship AI they can trust. Today, we are thrilled to announce Giskard v3, a complete evolution of the library designed for the modern landscape of Large Language Models (LLMs) and sophisticated AI agents.
