Universal Domain Adaptation from Foundation Models: A Baseline Study
This work addresses domain adaptation challenges for machine learning practitioners by highlighting the limitations of current UniDA methods with foundation models and offering a simple, effective baseline, though it is incremental as it builds on existing foundation model capabilities.
The paper tackles the problem of universal domain adaptation (UniDA) using foundation models like CLIP, finding that existing methods often fail to improve over baselines, and proposes CLIP distillation with self-calibration, which outperforms previous approaches in most benchmark tasks, excelling in metrics such as H-score/H³-score and universal classification rate (UCR).
Foundation models (e.g., CLIP or DINOv2) have shown their impressive learning and transfer capabilities in a wide range of visual tasks, by training on a large corpus of data and adapting to specific downstream tasks. It is, however, interesting that foundation models have not been fully explored for universal domain adaptation (UniDA), which is to learn models using labeled data in a source domain and unlabeled data in a target one, such that the learned models can successfully adapt to the target data. In this paper, we make comprehensive empirical studies of state-of-the-art UniDA methods using foundation models. We first observe that, unlike fine-tuning from ImageNet pre-trained models, as previous methods do, fine-tuning from foundation models yields significantly poorer results, sometimes even worse than training from scratch. While freezing the backbones, we demonstrate that although the foundation models greatly improve the performance of the baseline method that trains the models on the source data alone, existing UniDA methods generally fail to improve over the baseline. This suggests that new research efforts are very necessary for UniDA using foundation models. Based on these findings, we introduce \textit{CLIP distillation}, a parameter-free method specifically designed to distill target knowledge from CLIP models. The core of our \textit{CLIP distillation} lies in a self-calibration technique for automatic temperature scaling, a feature that significantly enhances the baseline's out-class detection capability. Although simple, our method outperforms previous approaches in most benchmark tasks, excelling in evaluation metrics including H-score/H$^3$-score and the newly proposed universal classification rate (UCR) metric. We hope that our investigation and the proposed simple framework can serve as a strong baseline to facilitate future studies in this field.