Mixture of Chapters: Scaling Learnt Memory in Transformers

Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi

arXiv:2603.2109644.9h-index: 2

AI Analysis

This addresses the need for scalable explicit memory in Transformers, offering improved knowledge retention and reduced forgetting during training phase transitions, though it is an incremental advancement building on Mixture-of-Experts ideas.

The paper tackles the problem of Transformers lacking explicit memory for storing knowledge by introducing learnable sparse memory banks with chapter-based routing, enabling scaling to 262K memory tokens and surpassing iso-FLOP baselines in pre-training and instruction fine-tuning benchmarks.

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

View on arXiv PDF

Similar