AICLMar 30

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

arXiv:2603.2840797.82 citationsh-index: 15
AI Analysis

This work addresses the need for better evaluation benchmarks for deep research agents, particularly in multimodal and process-oriented contexts, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating deep research systems by introducing MiroEval, a benchmark with 100 tasks (70 text-only, 30 multimodal) that assesses systems along three dimensions: adaptive synthesis quality, agentic factuality verification, and process-centric evaluation, finding that multimodal tasks are more challenging, causing most systems to decline by 3 to 10 points, and that MiroThinker-H1 achieved the highest overall performance.

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes