CLIP the Gap: A Single Domain Generalization Approach for Object Detection
This addresses a critical gap in domain generalization for object detection, which is essential for real-world applications like autonomous driving, though it is incremental as it builds on pre-existing vision-language models.
The paper tackles the problem of single domain generalization for object detection, where models are trained on one source domain to generalize to unseen target domains, and achieves a 10% improvement over the only existing method on a weather-driving benchmark.
Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD [49], on their own diverse weather-driving benchmark.