CVMar 29

Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

arXiv:2603.2755661.5h-index: 24
Predicted impact top 55% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in open-vocabulary detection, this work formalizes the domain generalization problem and provides a method to mitigate performance degradation under distribution shifts.

The paper identifies that Open-Vocabulary Object Detection (OVOD) fails under domain shifts due to collapse of cross-modal alignment, and proposes Progressive Domain-invariant Cross-modal Alignment (PICA) which uses a curriculum and adaptive pseudo-word prototypes to enforce invariant alignment, achieving improved robustness on a new DG-OVOD benchmark.

Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes