CVJul 10, 2025

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

arXiv:2507.07997v26 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of improving reconstruction fidelity in VQ-VAEs for HD image processing, representing an incremental advance in quantization strategies.

The paper tackles the performance gap between VQ-VAEs and VAEs in image reconstruction by proposing MGVQ, a method using multi-group quantization to enhance codebook representation, achieving state-of-the-art results with rFID 0.49 vs. 0.91 on ImageNet and superior PSNR on zero-shot benchmarks.

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes