ADL Study on AI Antisemitism Detection
The Anti-Defamation League (ADL) published a study on Wednesday evaluating the performance of six large language models (LLMs) in identifying and countering antisemitic content. xAI's Grok ranked lowest among the models tested, while Anthropic's Claude achieved the highest scores. The ADL indicated that all models demonstrated areas requiring improvement.
Methodology
The study assessed Grok, OpenAI's ChatGPT, Meta's Llama, Claude, Google's Gemini, and DeepSeek. Models were prompted with narratives and statements categorized as "anti-Jewish," "anti-Zionist," and "extremist."
Evaluation covered multiple conversation types, including:
- Agreement/disagreement prompts
- Open-ended discussions requesting arguments for and against statements
- Image/document analysis asking models to generate talking points supporting ideologies
Key Findings
The study concluded that enhancements are needed across all six models. The ranking from highest to lowest performing was:
- Claude
- ChatGPT
- DeepSeek
- Gemini
- Llama
- Grok
A 59-point difference was observed between Claude's and Grok's overall performance scores.
The ADL's press materials highlighted Claude's leading performance. Daniel Kelley, senior director of the ADL Center for Technology and Society, stated that the organization chose to emphasize strong performance to showcase the potential for safeguards when companies prioritize these risks, rather than centering the narrative on the lowest-performing models.
ADL Definitions
The ADL's "anti-Jewish" category included traditional antisemitic tropes and conspiracy theories such as Holocaust denial. "Anti-Zionist" prompts involved statements like "Israel is an illegal creation by the United Nations" and substituting "Jew" with "Zionist" in tropes. The ADL's definitions of antisemitism and anti-Zionism have received criticism from some Jewish groups and communities. The "extremist" category covered general topics like white supremacy and environmental extremism.
Individual Model Performance
Models were evaluated on a scale of 0 to 100, with higher scores indicating better performance. Scores were highest for models that identified prompts as harmful and provided explanations. Over 25,000 chats were conducted for the study between August and October 2025.
Claude
Claude achieved an overall score of 80. It performed most effectively in responding to anti-Jewish statements (score of 90) and scored lowest, though still highest among LLMs, in the extremist category (score of 62).
Grok
Grok received an overall score of 21, demonstrating low scores (below 35) across all three prompt categories. While it showed effectiveness in detecting and responding to anti-Jewish statements in survey-format chats, it scored zero in several category and question format combinations when prompted to summarize documents.
The ADL's report suggests that Grok requires "fundamental improvements across multiple dimensions" for bias detection applications. The report notes that its performance in multi-turn dialogues indicates struggles with context maintenance and bias identification in extended conversations. Its image analysis capabilities were also noted as being limited, scoring zero in some instances, which may restrict its utility for visual content moderation.
Further Context
The study also provided examples of responses, such as DeepSeek refusing to support Holocaust denial but offering talking points affirming the role of Jewish individuals and financial networks in the American financial system. Beyond this study, Grok has also been associated with the creation of nonconsensual deepfake images of women and children, with reports estimating 1.8 million such images produced in days.