CVSep 13, 2024

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

arXiv:2409.08513v39 citationsh-index: 34
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy issues in open-vocabulary detection for real-time applications, representing an incremental improvement over existing YOLO-based methods.

The paper tackles the performance limitations of YOLO-World in open-vocabulary detection by introducing Mamba-YOLO-World, which uses a MambaFusion-PAN neck architecture to improve feature fusion with linear complexity and globally guided receptive fields, resulting in outperforming YOLO-World on COCO and LVIS benchmarks and surpassing state-of-the-art methods with fewer parameters and FLOPs.

Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes