LGOct 18, 2023

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré

arXiv:2310.12109v129.377 citationsh-index: 35

Originality Highly original

AI Analysis

This addresses the scalability bottleneck in machine learning models for tasks like language modeling and image classification, offering a potentially more efficient alternative to Transformers.

The paper tackles the problem of quadratic scaling in sequence length and model dimension in architectures like Transformers by introducing Monarch Mixer (M2), a sub-quadratic architecture using Monarch matrices, which matches BERT-base and BERT-large in GLUE quality with up to 27% fewer parameters and achieves up to 9.1x higher throughput at sequence length 4K, outperforms ViT-b by 1% accuracy on ImageNet with half the parameters, and matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE.

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1$\times$ higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE--showing for the first time that it may be possible to match Transformer quality without attention or MLPs.

View on arXiv PDF

Similar