LanDA: Language-Guided Multi-Source Domain Adaptation
This addresses a practical challenge for researchers and practitioners in computer vision by enabling domain adaptation without needing target images, though it is incremental as it builds on existing multimodal models and optimal transfer theory.
The paper tackles the problem of multi-source domain adaptation without target domain images by using only textual descriptions, and demonstrates that their method outperforms standard fine-tuning and ensemble approaches across various benchmarks.
Multi-Source Domain Adaptation (MSDA) aims to mitigate changes in data distribution when transferring knowledge from multiple labeled source domains to an unlabeled target domain. However, existing MSDA techniques assume target domain images are available, yet overlook image-rich semantic information. Consequently, an open question is whether MSDA can be guided solely by textual cues in the absence of target domain images. By employing a multimodal model with a joint image and language embedding space, we propose a novel language-guided MSDA approach, termed LanDA, based on optimal transfer theory, which facilitates the transfer of multiple source domains to a new target domain, requiring only a textual description of the target domain without needing even a single target domain image, while retaining task-relevant information. We present extensive experiments across different transfer scenarios using a suite of relevant benchmarks, demonstrating that LanDA outperforms standard fine-tuning and ensemble approaches in both target and source domains.