CVApr 3, 2023

AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation

arXiv:2304.01110v227 citationsh-index: 55Has Code
AI Analysis

This addresses the problem of adapting action recognition models to unlabelled target domains with unknown categories for video analysis, representing an incremental improvement by leveraging pre-trained models.

The paper tackles open-set unsupervised video domain adaptation by using CLIP to reject target-private categories, achieving satisfactory rejection and better alignment between shared classes without needing oracle knowledge of label names.

Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes