CVITSep 7, 2025

Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

arXiv:2509.05925v16 citationsh-index: 9MLSP
Originality Highly original
AI Analysis

This addresses the need for semantic preservation over pixel-level reconstruction in emerging applications, offering a novel approach with zero-shot robustness across diverse data and tasks.

The paper tackles the problem of semantic image compression by compressing CLIP feature embeddings instead of pixels, achieving an average bit rate of 2-3*10^(-3) bits per pixel, which is less than 5% of the bitrate needed by mainstream methods for comparable performance.

Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes