RO AI CV LGJul 29, 2024

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant

arXiv:2407.20179v228.069 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for holistic visual understanding in robot learning, though it is incremental as it builds on existing vision foundation models.

The paper tackles the problem of vision-based robot policy learning by introducing Theia, a vision foundation model that distills multiple off-the-shelf vision foundation models to encode diverse visual knowledge, resulting in outperforming teacher models and prior robot learning models with less training data and smaller model sizes.

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.

View on arXiv PDF Code

Similar