Quantize-then-Rectify: Efficient VQ-VAE Training
This work addresses the efficiency problem for researchers and practitioners in multimodal AI by enabling faster and cheaper training of visual tokenizers, though it is incremental as it builds on existing VAE methods.
The paper tackles the high computational cost of training VQ-VAEs for visual tokenization by introducing Quantize-then-Rectify (ReVQ), which transforms pre-trained VAEs into VQ-VAEs, reducing training time by over two orders of magnitude to about 22 hours on a single GPU while maintaining competitive reconstruction quality with an rFID of 1.06.
Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.