CVAIJun 22, 2025

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

arXiv:2506.18071v23 citationsh-index: 17Has Code
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable visual evidence in video-language models for researchers and practitioners, though it is incremental as it builds on existing agentic reasoning methods.

The paper tackles the problem of poorly grounded predictions in Grounded Video Question Answering by proposing MUPA, a multi-path agentic approach that improves grounding fidelity without sacrificing answer accuracy, achieving state-of-the-art results with Acc@GQA of 30.3% and 47.4% on benchmarks.

Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes