While cybersecurity risks of generative AI are well-documented through frameworks like OWASP Top 10 and NIST AI Risk Management, there's no systematic taxonomy for non-security performance failures. Yet, these represent the emerging iceberg undermining enterprise AI adoption.
Security incidents make headlines, but AI projects typically fail due to subtle performance issues, systems that hallucinate product features, refuse legitimate customer requests, or provide incomplete service information. These failures don't trigger security alerts, but they erode trust and cause AI initiatives to be abandoned.
Giskard launches RealPerformance to address this gap: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and manufacturers.
Testing AI Agents in Real-World Scenarios
Existing AI safety frameworks comprehensively cover security threats but miss the larger problem: performance failures that don't compromise security but derail AI projects. While security breaches are dramatic, the primary obstacle to AI adoption is subtle performance issues that violate business rules and customer expectations.
At Giskard, we regularly test generative AI applications for enterprise clients across banking, insurance, and manufacturing. Through systematic analysis of hundreds of production failures, we identified a consistent pattern: most business-impacting failures aren't security-related, but performance issues affecting compliance, customer experience, and operational reliability.
We developed a business-focused taxonomy by categorising real failure patterns observed across industries. For each category, we generated realistic examples using authentic failure cases as inspiration, creating a framework that addresses everyday operational issues determining AI project success.
Performance failures quietly erode AI confidence. When systems consistently provide incomplete information or refuse routine requests, users lose trust in AI technology broadly. RealPerformance addresses this critical barrier to enterprise adoption by providing systematic evaluation of the performance failures that determine whether AI systems can reliably serve business needs.
What is RealPerformance?
RealPerformance is a dataset and a platform that provides pairs of chosen (compliant) and rejected (un-compliant) responses to help understand problematic behaviours when it comes to business compliance and performance. The dataset covers a wide range of these AI business compliance issues across multiple domains, making it an invaluable resource for researchers, developers, and organisations working on conversational AI agents.
So, what makes RealPerformance special?
- Comprehensive AI Issue Coverage: Covers critical performance problems like information addition and wrong moderation.
- Multi-domain coverage: Healthcare, finance, retail, technology, and other sectors
- Training-Ready Format: Labeled chosen/rejected response pairs for model training
- Enhanced Interpretability - Detailed reasoning for why responses are problematic
- Rich Application Context - Includes RAG and application-specific descriptions
- Real-World Grounding - Based on actual failure patterns from live AI deployments
.png)
Core AI Functional Vulnerabilities
RealPerformance systematically addresses six critical performance issues that commonly occur in conversational AI agents:
Systematic Generation of Real-World Test Cases
Similar to RealHarm, RealPerformance draws inspiration from actual AI failures, however it relies on generating cases based on a taxonomy. This taxonomy is based on source issues that contain existing problematic interactions with textual AI agents. After which, we used LLMs to reproduce these patterns.
Each generated issue is modeled using structured templates that define realistic business contexts, failure triggers, common problematic responses, and effective correction strategies. This allows us to create realistic examples while controlling and maintaining the quality requirements.
The generated is also adapted to key domains such as healthcare, finance, retail, and technology, ensuring industry relevance and dataset diversity. To ensure real-world relevance, the methodology integrates domain-specific business rules and constraints into each test case. It models realistic user intents and information needs and ensures that both the wrong and corrected responses accurately reflect the behavior patterns observed in actual AI systems.
Dataset Structure and Usage
RealPerformance provides a structured format that makes it easy to integrate into existing AI training and evaluation pipelines.
For AI practitioners in organisations, it can help with risk assessment within specific domains, it also ensures compliance with safety and regulatory standards, and provides a resource to educate teams on common AI safety challenges.
For AI researchers, this framework supports model evaluation by testing how well models differentiate safe from unsafe responses, enables fine-tuning using preference pairs for reinforcement learning, and provides a basis for benchmarking AI systems on safety and reliability.
Dataset Availability and Technical Details
RealPerformance is available as an open-source dataset on Hugging Face, providing comprehensive coverage of AI safety issues with real-world context. The dataset includes:
- 1,000+ conversation samples across multiple domains and performance issue types
- Preference learning format with chosen vs. rejected response pairs
- Detailed annotations including issue descriptions, reasoning, and severity levels
- Multi-domain coverage including healthcare, finance, retail, and technology
- Structured metadata for easy filtering and analysis
The dataset is designed to be easily integrated into existing AI evaluation pipelines and can be used for both training and testing purposes. By design, the benchmark incorporates diverse business contexts to ensure comprehensiveness, and representative samples are open-source.
Next steps
The dataset is designed to evolve with the AI landscape, incorporating new failure patterns as they emerge and expanding to cover new domains and languages. By providing a foundation for systematic AI safety testing, RealPerformance aims to contribute to the development of more reliable, trustworthy conversational AI systems.
Giskard continues to invest in AI safety research and development, with RealPerformance being part of a broader initiative to improve the trustworthiness and reliability of AI systems in production environments.
Get Involved
- GitHub Repository: https://github.com/giskard-ai/realperformance
- Documentation: https://realperformance.giskard.ai
- Community: Join discussions on AI safety testing and contribute to the project
For organisations interested in contributing to the RealPerformance initiative or any of our other initiatives, please reach out to the research team at info@giskard.ai.