AIJan 14

M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning

Xiaohan Yu, Chao Feng, Lang Mei, Chong Chen

arXiv:2601.09278v17.52 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the challenge of multimodal information seeking for AI agents, though it is incremental as it builds on existing text-based agents.

The paper tackles the problem of extending autonomous information-seeking agents to multimodal settings by proposing M^3Searcher, a modular agent that decouples information acquisition from answer derivation, and it outperforms existing approaches with strong transfer adaptability and effective reasoning in complex multimodal tasks.

Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M$^3$Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M$^3$Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M$^3$Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.

View on arXiv PDF

Similar