CLAIAug 1, 2025

Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

arXiv:2508.00788v11 citationsh-index: 3AI
Originality Synthesis-oriented
AI Analysis

This addresses fairness and inclusivity in AI for marginalized groups, but is incremental as it updates an existing benchmark with newer models.

The researchers tackled the problem of evaluating how well large language models handle nonbinary pronouns by introducing MISGENDERED+, an updated benchmark, and found that while accuracy improved for binary and gender-neutral pronouns compared to prior studies, performance on neopronouns and reverse inference tasks remained inconsistent.

Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs' handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes