CVOct 1, 2025

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

arXiv:2510.00458v13 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This work addresses domain adaptation challenges for vision-language object detection, offering incremental improvements for applications like autonomous driving and robotics.

The paper tackles performance degradation of vision-language object detectors under domain shift by introducing VLOD-TTA, a test-time adaptation framework that improves detection accuracy across diverse distribution shifts, such as stylized domains and low-light conditions, with consistent gains over zero-shot and baseline methods.

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes