CVMay 27, 2025

Open-Det: An Efficient Learning Framework for Open-Ended Detection

arXiv:2505.20639v14 citationsh-index: 9Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses the challenge of inefficient training and limited performance in open-ended object detection for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of Open-Ended object Detection (OED), which detects objects and generates category names without predefined vocabularies, by proposing the Open-Det framework to improve efficiency and performance. It achieves higher accuracy (+1.0% in APr) while using significantly less training data (0.077M vs. 5.077M), fewer epochs (31 vs. 149), and fewer GPU resources compared to the baseline GenerateU.

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes