ARApr 4

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

arXiv:2604.0382984.8h-index: 58
AI Analysis

This work addresses the memory-bound performance of Mamba models on modern hardware, offering a principled fusion approach for complex operator cascades.

Mambalaya proposes a reconfigurable accelerator for Mamba state-space models that uses Einsum-based fusion to reduce inter-operator memory traffic, achieving 4.9x speedup for prefill and 1.9x for generation over MARCA, and up to 1.5x over a recent fusion accelerator.

Mamba is an emerging, complex workload with various short-range and long-range dependencies, nonlinearities, and elementwise computations that are unable to run at near-peak speeds on modern hardware. Specifically, Mamba's complex dependency graph makes fusion across its full operator cascade difficult, leaving substantial inter-operator memory traffic on the table. To address these challenges, we propose Mambalaya, a novel reconfigurable accelerator that leverages fusion to overcome the limitations of Mamba. We use the recently proposed cascade-of-Einsums abstraction to characterize Mamba's full computational structure, then apply the extended Einsum framework to systematically explore inter-Einsum fusion opportunities. This principled approach yields a series of fusion mappings that reduce off-chip inter-Einsum traffic. These mappings are supported by the underlying Mambalaya architecture. Mambalaya achieves a layer performance speedup of 4.9$\times$ for prefill and 1.9$\times$ for generation over MARCA. In prefill-dominated scenarios, it achieves up to 1.5$\times$ over a recent fine-grained, memory-aware fusion accelerator for Mamba.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes