CLAICVJul 8, 2024

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

arXiv:2407.06135v199 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of inefficient and fragmented multimodal generation for researchers and developers, though it is incremental as it builds on existing models like Chameleon.

The paper tackles the limitations of previous open-source large multimodal models by introducing Anole, an autoregressive native model for interleaved image-text generation, which achieves high-quality and coherent multimodal outputs without relying on adapters or separate diffusion models.

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes