G
News
July 17, 2025
10 minutes

RealPerformance, A Dataset of Language Model Business Compliance Issues

Giskard launches RealPerformance to address this gap: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and manufacturers.

David Berenstein

While cybersecurity risks of generative AI are well-documented through frameworks like OWASP Top 10 and NIST AI Risk Management, there's no systematic taxonomy for non-security performance failures. Yet, these represent the emerging iceberg undermining enterprise AI adoption.

Security incidents make headlines, but AI projects typically fail due to subtle performance issues, systems that hallucinate product features, refuse legitimate customer requests, or provide incomplete service information. These failures don't trigger security alerts, but they erode trust and cause AI initiatives to be abandoned.

Giskard launches RealPerformance to address this gap: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and manufacturers.

Testing AI Agents in Real-World Scenarios

Existing AI safety frameworks comprehensively cover security threats but miss the larger problem: performance failures that don't compromise security but derail AI projects. While security breaches are dramatic, the primary obstacle to AI adoption is subtle performance issues that violate business rules and customer expectations.

At Giskard, we regularly test generative AI applications for enterprise clients across banking, insurance, and manufacturing. Through systematic analysis of hundreds of production failures, we identified a consistent pattern: most business-impacting failures aren't security-related, but performance issues affecting compliance, customer experience, and operational reliability.

We developed a business-focused taxonomy by categorising real failure patterns observed across industries. For each category, we generated realistic examples using authentic failure cases as inspiration, creating a framework that addresses everyday operational issues determining AI project success.

Performance failures quietly erode AI confidence. When systems consistently provide incomplete information or refuse routine requests, users lose trust in AI technology broadly. RealPerformance addresses this critical barrier to enterprise adoption by providing systematic evaluation of the performance failures that determine whether AI systems can reliably serve business needs.

What is RealPerformance?

RealPerformance is a dataset and a platform that provides pairs of chosen (compliant) and rejected (un-compliant) responses to help understand problematic behaviours when it comes to business compliance and performance. The dataset covers a wide range of these AI business compliance issues across multiple domains, making it an invaluable resource for researchers, developers, and organisations working on conversational AI agents.

So, what makes RealPerformance special?

  • Comprehensive AI Issue Coverage: Covers critical performance problems like information addition and wrong moderation.
  • Multi-domain coverage: Healthcare, finance, retail, technology, and other sectors
  • Training-Ready Format: Labeled chosen/rejected response pairs for model training
  • Enhanced Interpretability - Detailed reasoning for why responses are problematic
  • Rich Application Context - Includes RAG and application-specific descriptions
  • Real-World Grounding - Based on actual failure patterns from live AI deployments
En example of a case within the RealPerformance dataset can be found underneath.

Core AI Functional Vulnerabilities

RealPerformance systematically addresses six critical performance issues that commonly occur in conversational AI agents:

Issues Description Example Issue Business Impact
Addition of Information When AI systems incorrectly add information not present in their context or knowledge base. An AI assistant adds information about discounts and complimentary services. Revenue loss from false promises and customer dissatisfaction
Business out of scope When AI systems provide an answer that is not within the business scope of the bot. An AI system shares revenue from consumer sales. Compliance violations and competitive intelligence leaks
Denial of Answer When AI systems incorrectly refuse to answer legitimate questions within their scope. An AI rejects talking about debt management, while it is a service your company offers. Lost sales opportunities and customer abandonment
Contradiction When AI responses contradict reference context or established rules. An AI not being faithful to the retrieved RAG context containing IRS regulations. Regulatory exposure and decision-making confusion
Omission When AI systems fail to provide complete information available in their context. An AI omits critical details about the extensive types of data collected by the system. Legal exposure and customer trust erosion from undisclosed data collection practices
Wrong Moderation When AI systems apply inappropriate moderation responses. An AI incorrectly refuses to help with standard account security updates. Service disruption and legitimate customer blocking

Systematic Generation of Real-World Test Cases

Similar to RealHarm, RealPerformance draws inspiration from actual AI failures, however it relies on generating cases based on a taxonomy. This taxonomy is based on source issues that contain existing problematic interactions with textual AI agents. After which, we used LLMs to reproduce these patterns.

Each generated issue is modeled using structured templates that define realistic business contexts, failure triggers, common problematic responses, and effective correction strategies. This allows us to create realistic examples while controlling and maintaining the quality requirements.

The generated is also adapted to key domains such as healthcare, finance, retail, and technology, ensuring industry relevance and dataset diversity. To ensure real-world relevance, the methodology integrates domain-specific business rules and constraints into each test case. It models realistic user intents and information needs and ensures that both the wrong and corrected responses accurately reflect the behavior patterns observed in actual AI systems.

A high-level overview of the RealPerformance methodology.

Dataset Structure and Usage

RealPerformance provides a structured format that makes it easy to integrate into existing AI training and evaluation pipelines.

For AI practitioners in organisations, it can help with risk assessment within specific domains, it also ensures compliance with safety and regulatory standards, and provides a resource to educate teams on common AI safety challenges.

For AI researchers, this framework supports model evaluation by testing how well models differentiate safe from unsafe responses, enables fine-tuning using preference pairs for reinforcement learning, and provides a basis for benchmarking AI systems on safety and reliability.

Dataset Availability and Technical Details

RealPerformance is available as an open-source dataset on Hugging Face, providing comprehensive coverage of AI safety issues with real-world context. The dataset includes:

  • 1,000+ conversation samples across multiple domains and performance issue types
  • Preference learning format with chosen vs. rejected response pairs
  • Detailed annotations including issue descriptions, reasoning, and severity levels
  • Multi-domain coverage including healthcare, finance, retail, and technology
  • Structured metadata for easy filtering and analysis

The dataset is designed to be easily integrated into existing AI evaluation pipelines and can be used for both training and testing purposes. By design, the benchmark incorporates diverse business contexts to ensure comprehensiveness, and representative samples are open-source.

Next steps

The dataset is designed to evolve with the AI landscape, incorporating new failure patterns as they emerge and expanding to cover new domains and languages. By providing a foundation for systematic AI safety testing, RealPerformance aims to contribute to the development of more reliable, trustworthy conversational AI systems.

Giskard continues to invest in AI safety research and development, with RealPerformance being part of a broader initiative to improve the trustworthiness and reliability of AI systems in production environments.

Get Involved

For organisations interested in contributing to the RealPerformance initiative or any of our other initiatives, please reach out to the research team at info@giskard.ai.

Integrate | Scan | Test | Automate

Giskard: Testing platform to secure LLM Agents

Get alerted of new vulnerabilities
Protect against AI risks
Identify security vulnerabilities & hallucination
Enable cross-team collaboration

RealPerformance, A Dataset of Language Model Business Compliance Issues

Giskard launches RealPerformance to address this gap: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and manufacturers.

While cybersecurity risks of generative AI are well-documented through frameworks like OWASP Top 10 and NIST AI Risk Management, there's no systematic taxonomy for non-security performance failures. Yet, these represent the emerging iceberg undermining enterprise AI adoption.

Security incidents make headlines, but AI projects typically fail due to subtle performance issues, systems that hallucinate product features, refuse legitimate customer requests, or provide incomplete service information. These failures don't trigger security alerts, but they erode trust and cause AI initiatives to be abandoned.

Giskard launches RealPerformance to address this gap: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and manufacturers.

Testing AI Agents in Real-World Scenarios

Existing AI safety frameworks comprehensively cover security threats but miss the larger problem: performance failures that don't compromise security but derail AI projects. While security breaches are dramatic, the primary obstacle to AI adoption is subtle performance issues that violate business rules and customer expectations.

At Giskard, we regularly test generative AI applications for enterprise clients across banking, insurance, and manufacturing. Through systematic analysis of hundreds of production failures, we identified a consistent pattern: most business-impacting failures aren't security-related, but performance issues affecting compliance, customer experience, and operational reliability.

We developed a business-focused taxonomy by categorising real failure patterns observed across industries. For each category, we generated realistic examples using authentic failure cases as inspiration, creating a framework that addresses everyday operational issues determining AI project success.

Performance failures quietly erode AI confidence. When systems consistently provide incomplete information or refuse routine requests, users lose trust in AI technology broadly. RealPerformance addresses this critical barrier to enterprise adoption by providing systematic evaluation of the performance failures that determine whether AI systems can reliably serve business needs.

What is RealPerformance?

RealPerformance is a dataset and a platform that provides pairs of chosen (compliant) and rejected (un-compliant) responses to help understand problematic behaviours when it comes to business compliance and performance. The dataset covers a wide range of these AI business compliance issues across multiple domains, making it an invaluable resource for researchers, developers, and organisations working on conversational AI agents.

So, what makes RealPerformance special?

  • Comprehensive AI Issue Coverage: Covers critical performance problems like information addition and wrong moderation.
  • Multi-domain coverage: Healthcare, finance, retail, technology, and other sectors
  • Training-Ready Format: Labeled chosen/rejected response pairs for model training
  • Enhanced Interpretability - Detailed reasoning for why responses are problematic
  • Rich Application Context - Includes RAG and application-specific descriptions
  • Real-World Grounding - Based on actual failure patterns from live AI deployments
En example of a case within the RealPerformance dataset can be found underneath.

Core AI Functional Vulnerabilities

RealPerformance systematically addresses six critical performance issues that commonly occur in conversational AI agents:

Issues Description Example Issue Business Impact
Addition of Information When AI systems incorrectly add information not present in their context or knowledge base. An AI assistant adds information about discounts and complimentary services. Revenue loss from false promises and customer dissatisfaction
Business out of scope When AI systems provide an answer that is not within the business scope of the bot. An AI system shares revenue from consumer sales. Compliance violations and competitive intelligence leaks
Denial of Answer When AI systems incorrectly refuse to answer legitimate questions within their scope. An AI rejects talking about debt management, while it is a service your company offers. Lost sales opportunities and customer abandonment
Contradiction When AI responses contradict reference context or established rules. An AI not being faithful to the retrieved RAG context containing IRS regulations. Regulatory exposure and decision-making confusion
Omission When AI systems fail to provide complete information available in their context. An AI omits critical details about the extensive types of data collected by the system. Legal exposure and customer trust erosion from undisclosed data collection practices
Wrong Moderation When AI systems apply inappropriate moderation responses. An AI incorrectly refuses to help with standard account security updates. Service disruption and legitimate customer blocking

Systematic Generation of Real-World Test Cases

Similar to RealHarm, RealPerformance draws inspiration from actual AI failures, however it relies on generating cases based on a taxonomy. This taxonomy is based on source issues that contain existing problematic interactions with textual AI agents. After which, we used LLMs to reproduce these patterns.

Each generated issue is modeled using structured templates that define realistic business contexts, failure triggers, common problematic responses, and effective correction strategies. This allows us to create realistic examples while controlling and maintaining the quality requirements.

The generated is also adapted to key domains such as healthcare, finance, retail, and technology, ensuring industry relevance and dataset diversity. To ensure real-world relevance, the methodology integrates domain-specific business rules and constraints into each test case. It models realistic user intents and information needs and ensures that both the wrong and corrected responses accurately reflect the behavior patterns observed in actual AI systems.

A high-level overview of the RealPerformance methodology.

Dataset Structure and Usage

RealPerformance provides a structured format that makes it easy to integrate into existing AI training and evaluation pipelines.

For AI practitioners in organisations, it can help with risk assessment within specific domains, it also ensures compliance with safety and regulatory standards, and provides a resource to educate teams on common AI safety challenges.

For AI researchers, this framework supports model evaluation by testing how well models differentiate safe from unsafe responses, enables fine-tuning using preference pairs for reinforcement learning, and provides a basis for benchmarking AI systems on safety and reliability.

Dataset Availability and Technical Details

RealPerformance is available as an open-source dataset on Hugging Face, providing comprehensive coverage of AI safety issues with real-world context. The dataset includes:

  • 1,000+ conversation samples across multiple domains and performance issue types
  • Preference learning format with chosen vs. rejected response pairs
  • Detailed annotations including issue descriptions, reasoning, and severity levels
  • Multi-domain coverage including healthcare, finance, retail, and technology
  • Structured metadata for easy filtering and analysis

The dataset is designed to be easily integrated into existing AI evaluation pipelines and can be used for both training and testing purposes. By design, the benchmark incorporates diverse business contexts to ensure comprehensiveness, and representative samples are open-source.

Next steps

The dataset is designed to evolve with the AI landscape, incorporating new failure patterns as they emerge and expanding to cover new domains and languages. By providing a foundation for systematic AI safety testing, RealPerformance aims to contribute to the development of more reliable, trustworthy conversational AI systems.

Giskard continues to invest in AI safety research and development, with RealPerformance being part of a broader initiative to improve the trustworthiness and reliability of AI systems in production environments.

Get Involved

For organisations interested in contributing to the RealPerformance initiative or any of our other initiatives, please reach out to the research team at info@giskard.ai.

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.