MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
This addresses the limitation of detecting novel objects in medical imaging for clinicians and researchers, though it is incremental as it adapts existing open-vocabulary methods to a new domain.
The paper tackles the problem of closed-set object detection in medical imaging by introducing MedROV, a real-time open-vocabulary detection model, which outperforms the previous state-of-the-art by an average of 40 mAP50 and runs at 70 FPS.
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.