BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation
This addresses fairness concerns in automatic evaluation metrics for generative systems, which is critical for developers and users to ensure equitable AI applications, though it is incremental as it builds on known biases in PLMs.
The study systematically examined social bias in pre-trained language model-based metrics for text generation, finding that popular metrics like BERTScore exhibit significantly higher bias across six sensitive attributes compared to traditional metrics, and developed debiasing adapters to mitigate this bias while maintaining evaluation performance.
Automatic evaluation metrics are crucial to the development of generative systems. In recent years, pre-trained language model (PLM) based metrics, such as BERTScore, have been commonly adopted in various generation tasks. However, it has been demonstrated that PLMs encode a range of stereotypical societal biases, leading to a concern on the fairness of PLMs as metrics. To that end, this work presents the first systematic study on the social bias in PLM-based metrics. We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes, namely race, gender, religion, physical appearance, age, and socioeconomic status. In-depth analysis suggests that choosing paradigms (matching, regression, or generation) of the metric has a greater impact on fairness than choosing PLMs. In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.