AIMay 29, 2025

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin

arXiv:2505.23990v216.517 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient adaptive assistance in dynamic, information-rich scenarios for humans interacting with robots, though it appears incremental as it builds on existing retrieval-augmented generation and multimodal methods.

The paper tackles the problem of adaptive video understanding for human assistance by presenting Multi-RAG, a multimodal retrieval-augmented generation system that integrates video, audio, and text to reduce cognitive load. It achieves superior performance on the MMBench-Video dataset compared to existing models while using fewer resources and less input data.

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

View on arXiv PDF

Similar