CVJul 14, 2025

LLM-Guided Agentic Object Detection for Open-World Understanding

arXiv:2507.10844v15 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the need for more autonomous and adaptable object detection systems in open-world environments, though it builds incrementally on existing open-world and open-vocabulary detection approaches.

The paper tackles the problem of object detection in open-world scenarios where traditional methods require fixed category sets and costly retraining for novel objects. The proposed LLM-guided agentic object detection (LAOD) framework achieves fully label-free, zero-shot detection by using an LLM to generate scene-specific object names and an open-vocabulary detector for localization, showing strong performance on LVIS, COCO, and COCO-OOD datasets.

Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes