MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
This addresses the need for better evaluation of multimodal agentic search for researchers and developers, though it is incremental as it builds on existing MM-RAG paradigms.
The authors tackled the lack of benchmarks for agentic multimodal retrieval-augmented generation (MM-RAG) with long reasoning chains by introducing MC-Search, a benchmark with 3,333 examples averaging 3.7 hops, and revealed systematic issues in six leading MLLMs while showing that their data improves planning and retrieval fidelity.
With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.