CLAIMar 22

Enhancing reasoning accuracy in large language models during inference time

arXiv:2603.2130173.5h-index: 1
Predicted impact top 86% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work addresses the need for more reliable reasoning in LLMs for low- to moderate-risk domains, but it is incremental as it builds on existing techniques like Chain-of-Thought prompting.

The paper tackled the problem of unreliable multi-step reasoning in large language models during inference by evaluating three inference-time strategies, finding that self-consistency with nucleus sampling and controlled temperature achieved a 9% to 15% absolute accuracy improvement over greedy single-pass decoding.

Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes