Scalable MatMul-free Language Modeling
This addresses scalability issues for AI practitioners by reducing resource demands, though it is incremental as it builds on existing LLM architectures.
The paper tackled the computational and memory challenges of large language models by eliminating matrix multiplication operations, achieving comparable performance to state-of-the-art Transformers with up to 61% memory savings in training and over 10x reduction in inference, and demonstrating 4x higher throughput with 10x less energy on neuromorphic systems.
Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.