CVAug 20, 2025

Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

arXiv:2508.14707v21 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of resource constraints for institutions lacking large-scale data and GPUs by enabling more efficient VFM development, though it is incremental as it builds on existing pre-trained models.

The paper tackles the bottleneck of training vision foundation models (VFMs) by proposing a model-driven approach that unifies multiple pre-trained teacher models to transfer knowledge without requiring large labeled datasets, resulting in a VFM that outperforms data-centric models across four fundamental vision tasks.

Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we present a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes