LLM-I: LLMs are Naturally Interleaved Multimodal Creators
This addresses the bottleneck in multimodal generation for tasks requiring factual grounding or programmatic precision, though it appears incremental as it builds on existing tool-use and RL methods.
The paper tackles the problem of interleaved image-text generation by proposing LLM-I, a framework that treats it as a tool-use problem, enabling a central agent to orchestrate specialized visual tools, and it achieves state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks.
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.