CLSep 27, 2025

From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi

arXiv:2509.23196v14.91 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in improving reasoning performance for LLMs, offering a practical solution for researchers and practitioners, though it is incremental as it builds on existing demonstration methods.

The paper tackles the problem that reasoning LLMs often perform worse with few-shot chain-of-thought demonstrations than with direct answering, identifying semantic misguidance and strategy transfer failure as causes. It introduces Insight-to-Solve (I2S), a test-time procedure that converts demonstrations into reusable insights, resulting in consistent accuracy improvements, such as a +14.0% increase for GPT-4.1 on AIME'25.

Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

View on arXiv PDF

Similar