Foundry: Distilling 3D Foundation Models for the Edge
This addresses the deployment challenge for edge devices like robots and AR/VR headsets, offering a novel compression method that retains generality, though it is incremental in applying distillation to 3D models.
The paper tackles the problem of deploying large 3D foundation models on edge devices by introducing Foundation Model Distillation (FMD) to compress them into efficient proxies, achieving strong performance across tasks like classification and segmentation while using significantly fewer tokens and FLOPs.
Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.