CLAIMar 10, 2025

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

arXiv:2503.06987v29 citationsh-index: 10ACL
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of evaluating social bias in long-form text generation for researchers and developers of large language models, though it is incremental as it adapts an existing benchmark.

The authors tackled the problem of measuring social bias in long-form generation by large language models, proposing a Bias Benchmark for Generation (BBG) adapted from a QA benchmark, and found inconsistent results between generation and QA-based evaluations across ten LLMs in English and Korean.

Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes