CLMay 16, 2024

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta AIMIT
arXiv:2405.09818v2854 citationsh-index: 31
Originality Highly original
AI Analysis

This addresses the need for more integrated AI systems capable of handling arbitrary sequences of images and text, representing a significant step forward in multimodal AI, though it builds on existing early-fusion and token-based approaches.

The paper tackles the problem of unified modeling for multimodal documents by introducing Chameleon, a family of early-fusion token-based models that understand and generate images and text in any sequence, achieving state-of-the-art performance in image captioning, outperforming Llama-2 in text tasks, and matching or exceeding larger models like GPT-4V in mixed-modal generation.

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes