The Impact of Unstated Norms in Bias Analysis of Language Models
This work addresses a methodological issue in bias analysis for AI researchers, highlighting potential misleading results in quantifying bias, which is incremental in refining evaluation techniques.
The paper tackles the problem of unrealistic bias measurements in language models caused by template-based probes, finding that these probes can artificially inflate bias due to mismatches with unstated norms in pretraining data, such as markedness.
Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It measures whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can lead to unrealistic bias measurements. For example, LLMs appear to mistakenly cast text associated with White race as negative at higher rates than other groups. We hypothesize that this arises artificially via a mismatch between commonly unstated norms, in the form of markedness, in the pretraining text of LLMs (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). The findings highlight the potential misleading impact of varying group membership through explicit mention in counterfactual bias quantification.