CLAICVLGFeb 19, 2024

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

arXiv:2402.12226v5249 citationsh-index: 66ACL
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating diverse data types into AI systems for broader multimodal applications, though it is incremental as it builds on existing LLM architectures.

The paper tackles the problem of unifying multiple modalities like speech, text, images, and music in a single language model using discrete representations, achieving performance comparable to specialized models across all modalities.

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes