AICLLGSep 8, 2025

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

arXiv:2509.06861v18 citationsh-index: 12Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of applying test-time scaling to knowledge-intensive tasks, highlighting its current limitations for users requiring high factual accuracy and low hallucination rates, and is incremental in evaluating existing methods on new benchmarks.

The paper tackles the problem of test-time scaling in reasoning models for knowledge-intensive tasks, finding that increasing inference-time computation does not consistently improve accuracy and often leads to more hallucinations, with accuracy gains sometimes due to abstention rather than factual recall.

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes