Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
This work addresses efficient language modeling for low-resource settings, but it is incremental as it builds on existing linear attention and optimization techniques.
The authors tackled sample-efficient language modeling under resource constraints by developing BLaLM, which replaces self-attention with linear-time mLSTM and uses lightweight enhancements like sliding window attention. Their model improved zero-shot performance and reduced perplexity with the Muon optimizer, achieving gains in the BabyLM 2025 shared task.
We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.