TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
For researchers building multimodal deep research agents, this work provides the first benchmark and strong baseline for generating reports with factually reliable and contextually aligned visual elements.
TVIR introduces a benchmark and agent framework for text-visual interleaved report generation, addressing the lack of evaluation for visual reliability and alignment. The proposed TVIR-Agent outperforms nine existing deep research systems on the TVIR-Bench.
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.