Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

arXiv:2605.0668315.6

Predicted impact top 86% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on efficient sequence models, TMM offers a low-complexity alternative to transformers with improved information retention and competitive performance.

The authors introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that replaces attention with triangular-masked Toeplitz matrix multiplication, achieving O(d n log n) time and O(d n) space complexity during training and O(d n) at inference. TMMs show greater training efficiency, better information retention, and superior performance on information retrieval and in-context learning benchmarks compared to similar architectures.

Transformer-based large language models are in some respects limited by the quadratic time and space computational complexity of attention. We introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that swaps attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension resulting in $\mathcal{O} (dn \log n)$ time and $\mathcal O(dn)$ space complexity during training and $\mathcal O(dn)$ time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. We demonstrate that TMMs are capable of retaining more input information resulting in improved copying ability, which we argue results from a lack of architectural biases. Consistent with higher input information retention, TMMs exhibit superior information retrieval and in-context learning benchmark accuracy compared to comparable architectures. We conclude with an analysis from the perspective of operator index theory and show that, counterintuitively, trained Toeplitz layers of causal non-invertible models are more likely to be invertible or nearly so than models that are actually invertible over their inputs.

View on arXiv PDF

Similar