CV LGApr 15, 2022

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Hyungyung Lee, Sungjin Park, Joonseok Lee, Edward Choi

arXiv:2204.07537v23.74 citationsh-index: 28Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of multimodal generation for AI and computer vision researchers, offering an incremental improvement by combining VQ-VAE with a Transformer encoder and masking strategy.

The paper tackles unconditional image-text pair generation by proposing Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint representations, and demonstrates its effectiveness in generating semantically consistent pairs through experiments on synthetic and real-world datasets, showing superiority over baselines.

Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.

View on arXiv PDF Code

Similar