CLCVLGDec 11, 2024

Multimodal Latent Language Modeling with Next-Token Diffusion

Tsinghua
arXiv:2412.08635v145 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the challenge of building scalable and efficient multimodal models for AI applications, offering a general-purpose interface that could advance large multimodal models, though it appears incremental by combining existing techniques like VAEs and diffusion.

The paper tackles the problem of unifying multimodal generative models for discrete and continuous data by proposing LatentLM, which integrates continuous data as latent vectors using a VAE and employs next-token diffusion for autoregressive generation. It demonstrates effectiveness across modalities, such as surpassing Diffusion Transformers in image generation, outperforming VALL-E 2 in text-to-speech with 10x fewer decoding steps, and achieving favorable performance in scaling up training tokens.

Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $σ$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes