A Text-Guided Vision Model for Enhanced Recognition of Small Instances
This is an incremental improvement for drone-based object detection systems, enabling more precise identification of specific targets through text prompts.
The paper tackles the problem of improving small object detection in drone-based applications by developing an enhanced text-guided vision model, achieving modest gains in precision (40.6% to 41.6%), recall (30.8% to 31%), F1 score (35% to 35.5%), and mAP@0.5 (30.4% to 30.7%) on the VisDrone dataset while reducing parameters from 4M to 3.8M and FLOPs from 15.7B to 15.2B.
As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone-based applications.