TextMamba: Scene Text Detector with Mamba
This work improves scene text detection for applications like document analysis and image understanding, but it is incremental as it builds on existing Mamba and Transformer approaches.
The paper tackles the problem of scene text detection by addressing limitations in Transformer-based methods, such as forgetting important information or focusing on irrelevant representations when modeling long-range dependencies, and proposes a novel detector based on Mamba that integrates a selection mechanism with attention layers, achieving state-of-the-art or competitive performance with F-measures of 89.7%, 89.2%, and 78.5% on CTW1500, TotalText, and ICDAR19ArT benchmarks.
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.