Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection
For practitioners deploying VLM-based object detectors, RGSE provides a training-free method to maintain accuracy under distribution shifts, though it is an incremental improvement over existing test-time adaptation techniques.
RGSE addresses test-time distribution shifts in open-vocabulary object detection by refining text embeddings via evolutionary search without backpropagation, achieving state-of-the-art performance across multiple benchmarks with minimal overhead.
Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.