CVJan 30, 2024

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tencent
arXiv:2401.17270v3866 citationsh-index: 72CVPR
Originality Highly original
AI Analysis

This enables real-time detection of diverse objects in open scenarios, addressing a key bottleneck for practical applications like robotics and surveillance.

The paper tackles the limitation of YOLO detectors to predefined categories by introducing YOLO-World, which adds open-vocabulary detection capabilities, achieving 35.4 AP at 52.0 FPS on the LVIS dataset.

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes