CVAILGJul 15, 2024

Accessing Vision Foundation Models via ImageNet-1K

arXiv:2407.10366v211 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the challenge of making vision foundation models more accessible for broader research by reducing training costs and data requirements, though it is incremental as it builds on existing distillation techniques.

The paper tackles the problem of high resource demands and inaccessible training data for vision foundation models by proposing Proteus, a method to distill these models into smaller equivalents using only ImageNet-1K, achieving performance matching or surpassing larger models like DINOv2-L/14 and CLIP-L/14 across 19 benchmarks with 1.2M images.

Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named \textit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes