LGDec 13, 2024

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

arXiv:2412.10208v35 citationsh-index: 5ICML
Originality Incremental advance
AI Analysis

This addresses the efficiency and fidelity trade-off in generative modeling for applications like image and speech synthesis, representing an incremental improvement over existing methods.

The paper tackles the problem of high-fidelity generation with fast sampling in generative models by introducing ResGen, which uses Residual Vector Quantization (RVQ) to improve data fidelity without increasing inference steps. It demonstrates superior performance over autoregressive models in conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis, with enhanced fidelity or faster sampling as RVQ depth scales.

We introduce ResGen, an efficient Residual Vector Quantization (RVQ)-based generative model for high-fidelity generation with fast sampling. RVQ improves data fidelity by increasing the number of quantization steps, referred to as depth, but deeper quantization typically increases inference steps in generative models. To address this, ResGen directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Additionally, we formulate token masking and multi-token prediction within a probabilistic framework using discrete diffusion and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes