CL AI CYSep 22, 2025

Evaluating Large Language Models for Detecting Antisemitism

Jay Patel, Hrudayangam Mehta, Jeremy Blackburn

arXiv:2509.18293v21 citationsh-index: 2Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the problem of detecting antisemitism in social media content for moderators and researchers, but it is incremental as it focuses on evaluating existing LLMs with a new prompting method rather than a fundamental breakthrough.

The study evaluated eight open-source large language models (LLMs) for detecting antisemitic content using in-context definitions and a new prompting technique called Guided-CoT, which improved performance and utility, with Llama 3.1 70B outperforming fine-tuned GPT-3.5.

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations

View on arXiv PDF Code

Similar