CL LGMay 9

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

arXiv:2605.086965.0

Predicted impact top 81% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners needing efficient long-sequence generation, SRMs offer a practical alternative to Transformers with higher throughput and concurrency, though the novelty is incremental as it combines known ideas.

The paper introduces Structured Recurrent Mixers (SRMs), an architecture that converts between sequence-parallel training and recurrent inference, achieving 12x throughput and 170x concurrency over Transformers on vLLM, with a 30% increase in GSM8k Pass@k under constant compute.

Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

View on arXiv PDF

Similar