CVAISep 12, 2025

Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

arXiv:2509.09958v3
Originality Highly original
AI Analysis

This work addresses the problem of zero-shot referring expression comprehension for vision-language models, showing that workflow design can replace task-specific pretraining.

The authors tackled referring expression comprehension without task-specific training by reformulating it as box-wise visual-language verification, achieving competitive or superior performance on RefCOCO, RefCOCO+, and RefCOCOg datasets compared to trained baselines like GroundingDINO.

Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes