LGMar 5

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

arXiv:2604.16324h-index: 1Has Code

Originality Highly original

AI Analysis

This addresses a critical scaling limitation for deep neural networks, particularly in large models like GPT, with incremental improvements in efficiency and stability.

The paper tackles the activation memory bottleneck in backpropagation that scales with network depth, context length, and feature dimensions, introducing BASIS to reduce memory usage to O(L * RN) while achieving comparable validation loss (6.575 vs. 6.616) to exact backpropagation in GPT training.

The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis

View on arXiv PDF Code

Similar