LGJul 19, 2024

Longhorn: State Space Models are Amortized Online Learners

Apple
arXiv:2407.14207v541 citationsh-index: 19
Originality Highly original
AI Analysis

This work addresses the efficiency and scalability challenges in large language models for researchers and practitioners, offering a competitive alternative to Transformers with linear decoding complexity.

The paper tackles the quadratic decoding complexity limitation of Transformers in sequence modeling by introducing Longhorn, a novel deep state-space model architecture derived from online learning principles, which achieves a 1.8x improvement in sample efficiency over Mamba and extrapolates over contexts up to 16x longer during inference.

Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes