COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models
This addresses the need for more reliable bias assessment in language models, which is crucial for bias mitigation efforts, though it is incremental as it builds on existing benchmarks.
The paper tackles the problem that existing bias benchmarks for language models lack contextual considerations, and introduces COBIAS, a framework that measures bias reliability by evaluating model behavior across different contexts, showing alignment with human judgment (Spearman's ρ=0.65).
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices. Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets. These benchmarks measure bias by observing an LLM's behavior on biased statements. However, these statements lack contextual considerations of the situations they try to present. To address this, we introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear. We develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to measure a biased statement's reliability in detecting bias, based on the variance in model behavior across different contexts. To evaluate the metric, we augmented 2,291 stereotyped statements from two existing benchmark datasets by adding contextual information. We show that COBIAS aligns with human judgment on the contextual reliability of biased statements (Spearman's $ρ= 0.65, p = 3.4 * 10^{-60}$) and can be used to create reliable benchmarks, which would assist bias mitigation works.