CLFeb 12

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

arXiv:2602.11938v41 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses a confound in QA evaluation for researchers and benchmark designers, though it is incremental as it builds on existing QA and LLM methods.

The paper tackled the problem of underspecified questions in QA benchmarks, finding that 16% to over 50% of questions are underspecified and that LLMs perform significantly worse on them; rewriting these questions into fully specified variants consistently improved QA performance, indicating that many failures stem from question underspecification rather than model limitations.

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes