LGMay 7
Echo: KV-Cache-Free Associative Recall with Spectral Koopman OperatorsAnupama Sridhar, Alexander Johansen
Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from $O(r^{2})$ streaming state where $r$ is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy (${\sim}3\%$) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve $100\%$ retrieval accuracy on every configuration tested, including distractor gaps of $4{,}096$ tokens with $32$ KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.
MLMay 21, 2025
Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya BoundsAnupama Sridhar, Alexander Johansen
First-order adaptive optimization methods like Adam are the default choices for training modern deep neural networks. Despite their empirical success, the theoretical understanding of these methods in non-smooth settings, particularly in Deep ReLU networks, remains limited. ReLU activations create exponentially many region boundaries where standard smoothness assumptions break down. \textbf{We derive the first \(\tilde{O}\!\bigl(\sqrt{d_{\mathrm{eff}}/n}\bigr)\) generalization bound for Adam in Deep ReLU networks and the first global-optimal convergence for Adam in the non smooth, non convex relu landscape without a global PL or convexity assumption.} Our analysis is based on stratified Morse theory and novel results in Kakeya sets. We develop a multi-layer refinement framework that progressively tightens bounds on region crossings. We prove that the number of region crossings collapses from exponential to near-linear in the effective dimension. Using a Kakeya based method, we give a tighter generalization bound than PAC-Bayes approaches and showcase convergence using a mild uniform low barrier assumption.
MLFeb 8, 2025
Convergence of TD(0) under Polynomial Mixing with Nonlinear Function ApproximationAnupama Sridhar, Alexander Johansen
Temporal Difference Learning (TD(0)) is fundamental in reinforcement learning, yet its finite-sample behavior under non-i.i.d. data and nonlinear approximation remains unknown. We provide the first high-probability, finite-sample analysis of vanilla TD(0) on polynomially mixing Markov data, assuming only Holder continuity and bounded generalized gradients. This breaks with previous work, which often requires subsampling, projections, or instance-dependent step-sizes. Concretely, for mixing exponent $β> 1$, Holder continuity exponent $γ$, and step-size decay rate $η\in (1/2, 1]$, we show that, with high probability, \[ \| θ_t - θ^* \| \leq C(β, γ, η)\, t^{-β/2} + C'(γ, η)\, t^{-ηγ} \] after $t = \mathcal{O}(1/\varepsilon^2)$ iterations. These bounds match the known i.i.d. rates and hold even when initialization is nonstationary. Central to our proof is a novel discrete-time coupling that bypasses geometric ergodicity, yielding the first such guarantee for nonlinear TD(0) under realistic mixing.