CVMar 29, 2025

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander

U of Toronto

arXiv:2503.23220v115.59 citationsh-index: 44Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses domain adaptation for object detection, which is crucial for deploying models in real-world scenarios with varying data distributions, representing a strong specific gain rather than a foundational shift.

The paper tackles the problem of domain adaptive object detection by introducing DINO Teacher, which uses a large frozen DINOv2 backbone to generate more accurate target domain labels and align features, achieving state-of-the-art performance on multiple datasets.

The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher

View on arXiv PDF Code

Similar