MathBode: Understanding LLM Reasoning with Dynamical Systems
This provides a novel diagnostic tool for researchers and practitioners to better understand and compare LLM reasoning fidelity and consistency, though it is incremental as it builds on existing benchmarking methods.
The paper tackles the problem of diagnosing mathematical reasoning in large language models by introducing MathBode, a dynamic diagnostic that uses frequency-resolved metrics like gain and phase to reveal systematic low-pass behavior and growing phase lag, which accuracy alone misses, across five problem families.
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $φ\approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.