CL AI CYJun 16, 2022

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, Lisa Anne Hendricks

arXiv:2206.08325v29.968 citationsh-index: 32

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of rigorous benchmarking for language model harms, which is crucial for developers and researchers to mitigate risks in AI applications, though it is incremental in providing a framework rather than a new method.

The paper tackles the problem of evaluating harmful text generated by large language models by proposing six characteristics to guide benchmark design, and applies them to analyze existing benchmarks and the Perspective API.

Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.

View on arXiv PDF

Similar