CVAIMMMar 12, 2024

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

arXiv:2403.07839v143 citationsh-index: 7CVPR
Originality Highly original
AI Analysis

This addresses the computational bottleneck of deploying large vision-language models on edge devices, representing a strong incremental improvement in model compression techniques.

The paper tackles the problem of compressing large vision-language models like CLIP for resource-constrained platforms by proposing MoPE-CLIP, a structured pruning method using a novel Module-wise Pruning Error metric. It achieves state-of-the-art compression results, outperforming previous methods while maintaining strong zero-shot capabilities and competitive task-specific performance.

Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes