CV AIMar 18, 2025

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Dimitris N. Metaxas

arXiv:2503.13794v69 citationsh-index: 10

Originality Highly original

AI Analysis

This work addresses bias issues in synthetic data generation for open-vocabulary object detection, offering a more efficient and effective method for researchers and practitioners in computer vision.

The paper tackles the problem of bias and overfitting in open-vocabulary object detection by directly fusing hidden states from large language models into detectors, resulting in a 3.82% to 6.22% improvement on benchmarks with minimal computational overhead.

Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

View on arXiv PDF

Similar