CV AIOct 23, 2025

TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

Shu-Hao Zhang, Wei-Cheng Tang, Chen Wu, Peng Hu, Nan Li, Liang-Jie Zhang, Qi Zhang, Shao-Qun Zhang

arXiv:2510.21879v12 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the challenge of deploying large multimodal models on resource-constrained devices, representing an incremental improvement through extreme quantization.

The paper tackles the problem of compressing vision-language models like CLIP for efficient deployment by proposing TernaryCLIP, which converts weights to ternary format, achieving up to 99% ternarization, 16.98× compression, 2.3× inference acceleration, and maintaining performance on zero-shot tasks across 41 datasets.

Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.

View on arXiv PDF

Similar