VideoMamba: Spatio-Temporal Selective State Space Model
This work addresses the problem of high computational costs in video understanding for researchers and practitioners, offering an efficient baseline, though it is incremental as it adapts an existing architecture.
The paper tackled video recognition by introducing VideoMamba, a model based on Mamba's linear complexity and selective state space mechanisms, which achieved competitive performance and outstanding efficiency on various benchmarks.
We introduce VideoMamba, a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. Unlike transformers that rely on self-attention mechanisms leading to high computational costs by quadratic complexity, VideoMamba leverages Mamba's linear complexity and selective SSM mechanism for more efficient processing. The proposed Spatio-Temporal Forward and Backward SSM allows the model to effectively capture the complex relationship between non-sequential spatial and sequential temporal information in video. Consequently, VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos, demonstrated by competitive performance and outstanding efficiency on a variety of video understanding benchmarks. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.