IV CVMay 22, 2025

Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression

Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

arXiv:2505.16177v128.224 citationsh-index: 15IEEE transactions on circuits and systems for video technology (Print)

Originality Highly original

AI Analysis

This work addresses the challenge of efficient compression for images and videos at ultra-low bitrates, which is crucial for applications like streaming and storage, and while it builds on existing generative models, it introduces specific improvements for enhanced performance.

The paper tackles the problem of achieving high-realism and high-fidelity image and video compression at ultra-low bitrates by proposing Generative Latent Coding (GLC) models, which operate in a generative latent space to better align with human perception. For image compression, GLC-image achieves less than 0.04 bpp with 45% fewer bitrate than the previous SOTA while matching FID, and for video compression, GLC-video saves 65.3% bitrate over PLVC in terms of DISTS.

Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbf{G}enerative \textbf{L}atent \textbf{C}oding (\textbf{GLC}) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than $0.04$ bpp, achieving the same FID as previous SOTA model MS-ILLM while using $45\%$ fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3\% bitrate saving over PLVC in terms of DISTS.

View on arXiv PDF

Similar