CVAIMar 18, 2025

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

arXiv:2503.13794v69 citationsh-index: 10
Originality Highly original
AI Analysis

This work addresses bias issues in synthetic data generation for open-vocabulary object detection, offering a more efficient and effective method for researchers and practitioners in computer vision.

The paper tackles the problem of bias and overfitting in open-vocabulary object detection by directly fusing hidden states from large language models into detectors, resulting in a 3.82% to 6.22% improvement on benchmarks with minimal computational overhead.

Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes