June 3, 2026
6 min read
Pierre Le Jeune
Blanca Rivera Campos

Every frontier LLM generates harmful stereotypes in open-ended generation

When given the freedom to write stories, do frontier LLMs fall back on harmful stereotypes? Giskard's R&D team prompted 23 leading models to generate over 650,000 open-ended stories across 10 languages, then analyzed the demographic associations they produced. Every single model generated harmful stereotypes, many of which the models themselves recognized as harmful.
StereoTales: Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

When 23 frontier LLMs were given creative freedom to write stories, all of them surfaced harmful stereotypes, including content each model explicitly flagged as harmful in its own evaluation. This is the main finding of StereoTales, a new multilingual LLM bias detection framework released by Giskard's R&D team.

For a more technical deep-dive, refer to the research blog article here.

Current LLM bias benchmarks fall short

Standard LLM bias benchmarks (BBQ, StereoSet, CrowS-Pairs) test recognition: can the model identify a stereotype when directly asked? That's a different cognitive task than open-ended generation.

In StereoTales, models were prompted to write short stories (~200 words) featuring a protagonist defined by a single demographic attribute: "a non-binary person," "a person with low income," "a person from North Africa." Everything else about the protagonist emerged from the model's own associations.

The pipeline covered 79 attribute values across 19 demographic dimensions (gender, religion, income, immigration status, and more), generating over 650,000 stories from 23 leading models across providers including Anthropic, OpenAI, Google, Mistral, Alibaba, and xAI. Stories were generated in 10 languages: English, French, Spanish, Italian, Portuguese, Dutch, Ukrainian, Arabic, Hindi, and Chinese.

Stereotales - story explorer

Full story explorer here.

Main findings

Across 650,000 stories, 23 models, and 10 languages, StereoTales surfaced three main findings:

  • Every frontier LLM produces harmful stereotypes in free-form generation, regardless of model size, provider, or capability tier.
  • Models know which associations are harmful, but they generate them anyway. LLMs asked to rate the same associations they produced showed meaningful agreement with human raters, yet still generated content they themselves would flag as harmful. Recognition and generation are misaligned.
  • English-only bias testing is incomplete. Harmful stereotypes are largely language-specific: they adapt to the cultural context of the prompt language, amplifying biases against locally marginalized groups.

Finding 1: No LLM is free of stereotypes

The headline result leaves no room for nuance: all 23 models produced harmful stereotypes in open-ended generation, including the largest, highest-capability models. Even the least biased models in the set produced 20–30 harmful associations.

Some of the most widespread harmful associations (shared by all 23 models) include:

  • Low education → trades and manual labor
  • Non-binary gender → arts and creative industries
  • Low income → basic education
  • High income → Jewish religion
Stereotales - Association explorer, harmful associations

Full association explorer here.

Finding 2: Models know what's harmful, but still generates stereotypes

StereoTales asked the same 23 LLMs to rate each association for harmfulness, the same task given to human raters (see methodology in below section). The results expose a misalignment.

Human and model ratings correlate reasonably well overall. But the disagreement comes from:

  • LLMs underestimate harm on socioeconomic attributes: age, income, employment, education, political orientation, religion, urbanicity, immigration status.
  • LLMs overestimate harm on gender, gender alignment, and geographic origin, the axes that have received the most regulatory and public attention.

The alignment recipes have trained models to be hypersensitive to historically high-profile bias axes while leaving them relatively blind to the breadth of socioeconomic stereotyping.

More critically, every model generates associations it itself classifies as harmful. The generative and discriminative blind spots aligns: the attributes models most underestimate as harmful are also the ones where they produce the most stereotyped associations.

Finding 3: Harmful stereotypes in LLMs are language-specific

Harmful associations are not simply translated from an English-dominant training of models. LLMs adapt culturally to the prompt language, amplifying biases against locally salient protected groups.

Key findings from the per-language analysis:

  • Harmful biases are language-specific. Harmful associations tend to appear in only 1–2 languages. An English-only fairness evaluation will miss most of the harm a model produces when prompted in other languages.
  • Languages share biases with their cultural neighbours. French, Italian, and Dutch models produced heavily overlapping stereotypical content. So did Spanish and Portuguese. Languages with shared geography and cultural history inherit shared blind spots.
  • Models shift their bias targets depending on the prompt language. When prompted in Arabic, models generated fewer harmful associations targeting Muslims  (the majority group in Arabic-speaking regions) and more targeting Christians. The pattern held consistently: switching into a language tends to reduce stereotypes about that culture's dominant group and amplify them against its marginalized ones.

We describe this behavior as LLMs acting as "cultural chameleons", absorbing the bias most salient in the training corpus associated with the prompt language.

Methodology

Stereotales methodology

Each generated story was automatically analyzed to extract the full demographic profile of its protagonist (age, gender, income, religion, employment status, and more). This extraction was performed by an ensemble of three LLMs.

From there, statistical tests identified which demographic associations appeared more often than chance across the full story set, for example, whether low-income protagonists were often cast as less educated. Only associations with a large, consistent over-representation were kept.

Finally, to determine which associations were actually harmful, the team recruited 247 independent human raters who scored each association on a 1–5 harmfulness scale (each association was annotated by ~7 raters on average). An association was classified as harmful only if it scored ≥ 4, a conservative threshold, yielding 118 harmful associations out of 1,580 statistically significant ones.

Next steps

This is an ongoing research effort, and we're looking to expand StereoTales to more languages. If you're a native speaker interested in contributing to the framework (translating prompts, validating attribute values, or advising on cultural context) we'd love to hear from you. Reach out at [email protected].

Resources

The full StereoTales dataset, pipeline, and preprint are publicly available:

StereoTales was authored by Pierre Le Jeune, Etienne Duchesne, Stefano Palminteri, Weixuan Xiao, Bazire Houssin, Benoît Malézieux, and Matteo Dora. Preprint released May 22, 2026.

Continuously secure LLM agents, preventing hallucinations and security issues.
Book a Demo

You will also like

Phare LLM benchmark V2: Reasoning models don't guarantee better security

Phare LLM benchmark V2: Reasoning models don't guarantee better security

Phare (Potential Harm Assessment & Risk Evaluation) is an independent, multilingual benchmark designed to evaluate AI models across four critical dimensions, or “modules”: hallucination, bias, harmfulness, and vulnerability to jailbreaking attacks. This second version expands our evaluation to include reasoning models from leading providers, allowing us to assess whether these advanced systems represent a meaningful improvement in AI safety.

View post
Who judges the LLM-as-a-Judge? Meta-Evaluation of an LLM vulnerability scanner

Who judges the LLM-as-a-Judge? Meta-Evaluation of an LLM vulnerability scanner

When your LLM vulnerability scanner detects a threat, it relies on an LLM judge to decide whether the attack succeeded. Using one LLM to evaluate another can bring some failures into your evaluation pipeline (false positives, model drift, or context blindness). This article walks through how we meta-evaluated our own LLM-as-a-judge using giskard-checks to freeze expected verdicts, replay attack traces, and detect evaluator regressions in CI.

View post
Claude Mythos: Analyzing Anthropic’s new frontier model for AI security

Claude Mythos: Analyzing Anthropic’s new frontier model for AI security

In this article, we analyze Anthropic's newly announced Claude Mythos model and its announced capabilities in automated vulnerability discovery and exploit generation. We explore how this frontier model impacts the cybersecurity landscape.

View post
Get AI security insights in your inbox