LGMar 16

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

arXiv:2603.151953.61 citations

Predicted impact top 97% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the computational bottleneck in online learning for recurrent neural networks, offering a scalable solution with broad implications for real-time AI applications, though it is incremental in building on prior approximations.

The paper tackles the high computational cost of real-time recurrent learning (RTRL) by showing that the recurrent Jacobian is massively redundant, enabling sparse online learning with only a small fraction of paths (e.g., 6% for n=64) while recovering 84% of full RTRL's adaptation ability, and it demonstrates improved numerical stability and applicability to various architectures like LSTMs and transformers.

Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.

View on arXiv PDF

Similar