Mamba Drafters for Speculative Decoding
This work addresses the efficiency and flexibility challenges in accelerating LLM generation for AI practitioners, offering an incremental improvement over existing speculative decoding approaches.
The paper tackles the trade-off in speculative decoding for LLMs by introducing Mamba-based drafters, which combine flexibility and speed using state space models to avoid quadratic complexity, resulting in faster drafting, lower memory usage, and performance comparable to state-of-the-art methods while maintaining cross-model adaptability.
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.