CV AI LGOct 23, 2025

VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

Jesimon Barreto, Carlos Caetano, André Araujo, William Robson Schwartz

arXiv:2510.20994v1h-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of domain adaptation for vision models in scenarios where supervised fine-tuning is infeasible, offering a novel self-supervised approach that is incremental in advancing adaptation techniques.

The paper tackles the problem of adapting vision foundation models to new domains with distribution shifts and scarce labels by introducing VESSA, a self-supervised fine-tuning method using short multi-view object-centric videos without annotations, which demonstrates consistent improvements in downstream classification tasks compared to base models and previous methods.

Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

View on arXiv PDF Code

Similar