CVMay 23, 2024

LG-VQ: Language-Guided Codebook Learning

arXiv:2405.14206v212 citationsh-index: 35NIPS
Originality Incremental advance
AI Analysis

This addresses the issue of modal gaps in VQ for researchers and practitioners in multi-modal AI, though it is incremental as it builds on existing VQ models.

The paper tackles the problem of modal gaps in vector quantization (VQ) for image synthesis, which leads to suboptimal performance in multi-modal tasks like text-to-image generation, by proposing LG-VQ, a language-guided codebook learning framework that aligns codebooks with text; it achieves superior performance on reconstruction and multi-modal downstream tasks.

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes