CLJul 2, 2025

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

arXiv:2507.02088v24 citationsh-index: 6ACL
Originality Synthesis-oriented
AI Analysis

This addresses the problem of cultural and linguistic gaps in bias evaluation for LLMs, particularly for Chinese applications, though it is incremental as it extends existing bias evaluation frameworks to a new domain.

The authors tackled the lack of bias evaluation benchmarks for large language models in Chinese contexts by creating McBE, a multi-task dataset with 4,077 instances covering 12 bias categories and 5 evaluation tasks, and found that popular LLMs exhibited varying degrees of bias.

As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes