CL AI HC LGSep 7, 2025

Benchmarking Gender and Political Bias in Large Language Models

Jinrui Yang, Xudong Han, Timothy Baldwin

arXiv:2509.06164v22.7h-index: 15

Originality Incremental advance

AI Analysis

This addresses fairness and accountability issues in NLP for political applications, though it is incremental as it builds on existing bias evaluation methods with a new dataset.

The authors tackled the problem of evaluating bias in large language models (LLMs) by introducing EuroParlVote, a benchmark linking European Parliament speeches to votes with demographic metadata, and found that LLMs frequently misclassify female MEPs as male and show reduced accuracy for female speakers, with proprietary models like GPT-4o outperforming open-weight ones in robustness and fairness.

We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.

View on arXiv PDF

Similar