AICLLGMar 28, 2025

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Meta AIMIT
arXiv:2503.22674v239 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of real-world underspecified queries for AI systems, highlighting a specific bottleneck in information acquisition.

The paper tackles the problem of LLMs acquiring missing information in underspecified reasoning tasks by introducing QuestBench, a benchmark for evaluating their ability to ask minimal necessary questions, finding that models achieve 40-50% accuracy on logic and planning tasks despite excelling on math problems.

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes