Every frontier LLM generates harmful stereotypes in open-ended generation

When 23 frontier LLMs were given creative freedom to write stories, all of them surfaced harmful stereotypes, including content each model explicitly flagged as harmful in its own evaluation. This is the main finding of StereoTales, a new multilingual LLM bias detection framework released by Giskard's R&D team.

For a more technical deep-dive, refer to the research blog article here.

Current LLM bias benchmarks fall short

Standard LLM bias benchmarks (BBQ, StereoSet, CrowS-Pairs) test recognition: can the model identify a stereotype when directly asked? That's a different cognitive task than open-ended generation.

In StereoTales, models were prompted to write short stories (~200 words) featuring a protagonist defined by a single demographic attribute: "a non-binary person," "a person with low income," "a person from North Africa." Everything else about the protagonist emerged from the model's own associations.

The pipeline covered 79 attribute values across 19 demographic dimensions (gender, religion, income, immigration status, and more), generating over 650,000 stories from 23 leading models across providers including Anthropic, OpenAI, Google, Mistral, Alibaba, and xAI. Stories were generated in 10 languages: English, French, Spanish, Italian, Portuguese, Dutch, Ukrainian, Arabic, Hindi, and Chinese.

Full story explorer here.

Main findings

Across 650,000 stories, 23 models, and 10 languages, StereoTales surfaced three main findings:

Every frontier LLM produces harmful stereotypes in free-form generation, regardless of model size, provider, or capability tier.
Models know which associations are harmful, but they generate them anyway. LLMs asked to rate the same associations they produced showed meaningful agreement with human raters, yet still generated content they themselves would flag as harmful. Recognition and generation are misaligned.
English-only bias testing is incomplete. Harmful stereotypes are largely language-specific: they adapt to the cultural context of the prompt language, amplifying biases against locally marginalized groups.

Finding 1: No LLM is free of stereotypes

The headline result leaves no room for nuance: all 23 models produced harmful stereotypes in open-ended generation, including the largest, highest-capability models. Even the least biased models in the set produced 20–30 harmful associations.

Some of the most widespread harmful associations (shared by all 23 models) include:

Low education → trades and manual labor
Non-binary gender → arts and creative industries
Low income → basic education
High income → Jewish religion

Stereotales - Association explorer, harmful associations

Full association explorer here.

Finding 2: Models know what's harmful, but still generates stereotypes

StereoTales asked the same 23 LLMs to rate each association for harmfulness, the same task given to human raters (see methodology in below section). The results expose a misalignment.

Human and model ratings correlate reasonably well overall. But the disagreement comes from:

LLMs underestimate harm on socioeconomic attributes: age, income, employment, education, political orientation, religion, urbanicity, immigration status.
LLMs overestimate harm on gender, gender alignment, and geographic origin, the axes that have received the most regulatory and public attention.

The alignment recipes have trained models to be hypersensitive to historically high-profile bias axes while leaving them relatively blind to the breadth of socioeconomic stereotyping.

More critically, every model generates associations it itself classifies as harmful. The generative and discriminative blind spots aligns: the attributes models most underestimate as harmful are also the ones where they produce the most stereotyped associations.

Finding 3: Harmful stereotypes in LLMs are language-specific

Harmful associations are not simply translated from an English-dominant training of models. LLMs adapt culturally to the prompt language, amplifying biases against locally salient protected groups.

Key findings from the per-language analysis:

Harmful biases are language-specific. Harmful associations tend to appear in only 1–2 languages. An English-only fairness evaluation will miss most of the harm a model produces when prompted in other languages.
Languages share biases with their cultural neighbours. French, Italian, and Dutch models produced heavily overlapping stereotypical content. So did Spanish and Portuguese. Languages with shared geography and cultural history inherit shared blind spots.
Models shift their bias targets depending on the prompt language. When prompted in Arabic, models generated fewer harmful associations targeting Muslims (the majority group in Arabic-speaking regions) and more targeting Christians. The pattern held consistently: switching into a language tends to reduce stereotypes about that culture's dominant group and amplify them against its marginalized ones.

We describe this behavior as LLMs acting as "cultural chameleons", absorbing the bias most salient in the training corpus associated with the prompt language.

Methodology

Each generated story was automatically analyzed to extract the full demographic profile of its protagonist (age, gender, income, religion, employment status, and more). This extraction was performed by an ensemble of three LLMs.

From there, statistical tests identified which demographic associations appeared more often than chance across the full story set, for example, whether low-income protagonists were often cast as less educated. Only associations with a large, consistent over-representation were kept.

Finally, to determine which associations were actually harmful, the team recruited 247 independent human raters who scored each association on a 1–5 harmfulness scale (each association was annotated by ~7 raters on average). An association was classified as harmful only if it scored ≥ 4, a conservative threshold, yielding 118 harmful associations out of 1,580 statistically significant ones.

Next steps

This is an ongoing research effort, and we're looking to expand StereoTales to more languages. If you're a native speaker interested in contributing to the framework (translating prompts, validating attribute values, or advising on cultural context) we'd love to hear from you. Reach out at [email protected].

Resources

The full StereoTales dataset, pipeline, and preprint are publicly available:

Dataset: huggingface.co/datasets/giskardai/StereoTales
Source Code: github.com/Giskard-AI/stereotales-pipeline
Preprint: arxiv.org/abs/2605.10442
Research blog article: https://research.giskard.ai/blog/stereotales/

StereoTales was authored by Pierre Le Jeune, Etienne Duchesne, Stefano Palminteri, Weixuan Xiao, Bazire Houssin, Benoît Malézieux, and Matteo Dora. Preprint released May 22, 2026.

Every frontier LLM generates harmful stereotypes in open-ended generation

Current LLM bias benchmarks fall short

Main findings

Finding 1: No LLM is free of stereotypes

Finding 2: Models know what's harmful, but still generates stereotypes

Finding 3: Harmful stereotypes in LLMs are language-specific

Methodology

Next steps

Resources

You will also like

Phare LLM benchmark V2: Reasoning models don't guarantee better security

Who judges the LLM-as-a-Judge? Meta-Evaluation of an LLM vulnerability scanner

Claude Mythos: Analyzing Anthropic’s new frontier model for AI security

Get AI security insights in your inbox

Every frontier LLM generates harmful stereotypes in open-ended generation

Current LLM bias benchmarks fall short

Main findings

Finding 1: No LLM is free of stereotypes

Finding 2: Models know what's harmful, but still generates stereotypes

Finding 3: Harmful stereotypes in LLMs are language-specific

Methodology

Next steps

Resources

You will also like

Phare LLM benchmark V2: Reasoning models don't guarantee better security

Who judges the LLM-as-a-Judge? Meta-Evaluation of an LLM vulnerability scanner

Claude Mythos: Analyzing Anthropic’s new frontier model for AI security

Get AI security insights in your inbox

No vulnerabilities found? We refund the assessment.

No vulnerabilities found? 
We refund the assessment.