Unsupervised Domain Adaptation within Deep Foundation Latent Spaces
This work addresses domain adaptation for vision tasks, but it is incremental as it builds on existing foundation models and prototypical networks.
The paper tackles unsupervised domain adaptation using vision transformer-based foundation models without fine-tuning, showing that the proposed method improves upon existing baselines while highlighting its limitations.
The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.