LG AIApr 9, 2024

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

arXiv:2404.06170v14.61 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of high computational costs in knowledge distillation for practitioners, though it is incremental as it builds on existing CLIP and distillation methods.

The paper tackles the computational inefficiency of knowledge distillation with large teacher models by using CLIP embeddings as teachers, achieving results that outperform full-scale distillation with 9x less memory and 8x less training time.

Contrastive Language-Image Pre-training (CLIP) has been shown to improve zero-shot generalization capabilities of language and vision models. In this paper, we extend CLIP for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. In these cases, using only the embeddings of the teacher models to guide the distillation can yield significant computational savings. Our preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using $9\times$ less memory and $8\times$ less training time. Code available at: https://github.com/lnairGT/CLIP-Distillation/

View on arXiv PDF Code

Similar