LGAIFeb 16

Broken Chains: The Cost of Incomplete Reasoning in LLMs

arXiv:2602.14444v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the cost and efficiency challenges of deploying reasoning-specialized LLMs under resource constraints, though it is incremental as it builds on existing reasoning frameworks.

The study investigated how different reasoning modalities (code, natural language, hybrid, or none) perform under token constraints in large language models, finding that truncated reasoning can significantly hurt performance, code degrades more gracefully, and robustness varies by model, with some models collapsing to as low as 7% accuracy at reduced budgets.

Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes