CL AIMar 10, 2025

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Jiho Jin, Woosung Kang, Junho Myung, Alice Oh

arXiv:2503.06987v29 citationsh-index: 10ACL

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of evaluating social bias in long-form text generation for researchers and developers of large language models, though it is incremental as it adapts an existing benchmark.

The authors tackled the problem of measuring social bias in long-form generation by large language models, proposing a Bias Benchmark for Generation (BBG) adapted from a QA benchmark, and found inconsistent results between generation and QA-based evaluations across ten LLMs in English and Korean.

Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

View on arXiv PDF

Similar