EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
This work clarifies fundamental boundaries in recurrent models for researchers, identifying irreversible information loss in fixed-coefficient methods as a key bottleneck.
The paper investigates the limitations of exponential moving average (EMA) traces in sequence modeling, showing they achieve 96% of a supervised BiGRU's performance on grammatical role assignment without labels but cause an 8x increase in perplexity to 260 on C4 compared to GPT-2 in language modeling, highlighting their inability to preserve token identity.
What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.