When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

arXiv:2603.03475v12.72 citationsh-index: 1

Originality Incremental advance

AI Analysis

This exposes hidden unreliability in models used for education and decision support, calling for evaluation reforms, though it is incremental in analyzing existing models.

The paper tackled the problem of computational instabilities in mathematical reasoning models, revealing that state-of-the-art models achieve 61% accuracy but with 81.6% of correct predictions from unreliable pathways and 8.8% silent failures, while scaling parameters provided no accuracy benefit on a subset.

Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.

View on arXiv PDF

Similar