CLOct 8, 2021

Evaluation of Summarization Systems across Gender, Age, and Race

arXiv:2110.04384v130.7662 citations

Originality Synthesis-oriented

AI Analysis

This addresses fairness and bias issues in NLP evaluation for researchers and practitioners, highlighting a critical but often overlooked problem in model development.

The study investigated how summarization system evaluations are biased by the demographics of human annotators, showing that evaluations are sensitive to protected attributes like gender, age, and race, which can lead to models favoring certain groups over others.

Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios -- evaluation against gold summaries and system output ratings -- we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.

View on arXiv PDF

Similar