CVAIAug 24, 2025

DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers

arXiv:2508.17509v1
Originality Incremental advance
AI Analysis

This provides a scalable, label-efficient method for training vision transformers in resource-constrained environments, though it is incremental as it builds on existing self-supervised learning approaches.

The paper tackled the problem of training vision transformers with limited labeled data by combining DINO and Barlow Twins techniques, achieving comparable accuracy to DINO using only 10% of labeled data on MS COCO.

Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques--DINO (teacher-student learning) and Barlow Twins (redundancy reduction)--to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self-supervised learning, each comes with limitations--DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy-reduction objective of Barlow Twins with the self-distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10\% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label-efficient alternative for training ViTs in resource-constrained environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes