LGAICLApr 9, 2025

Adaptive Computation Pruning for the Forgetting Transformer

MILA
arXiv:2504.06949v23 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses the high computational cost of attention mechanisms in large language models, offering significant speedups for training, though it is incremental as it builds on the existing FoX architecture.

The paper tackles the computational inefficiency of the Forgetting Transformer (FoX) by proposing Adaptive Computation Pruning (ACP), which dynamically prunes negligible attention computations, resulting in a 70% reduction in FLOPs and memory accesses, a 2-3x speedup in attention runtime, and up to 40% increase in training throughput without performance loss.

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs provably safe pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 50% to 70% reduction in attention runtime (or a 2-3$\times$ speedup) and a roughly 10% to 40% increase in end-to-end training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes