CLFeb 19, 2025

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

arXiv:2502.13962v221 citationsh-index: 15ACL
Originality Incremental advance
AI Analysis

This work addresses the issue of when large language models should provide answers, which is important for applications requiring reliable and safe AI systems, though it is incremental in extending existing test-time scaling methods.

The paper tackles the problem of selective question answering by using test-time compute scaling to improve both accuracy and confidence in model responses, and proposes a new evaluation framework for non-zero risk settings.

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes