CVMAApr 25, 2025

VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

arXiv:2504.20091v213 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of multimodal reasoning in video understanding for AI researchers, offering a novel framework that improves accuracy on specific benchmarks, though it is incremental in its approach.

The paper tackles the challenge of capturing temporal and interactive contexts in Video Question Answering by introducing VideoMultiAgents, a multi-agent framework with specialized agents for vision, scene graph analysis, and text processing, achieving state-of-the-art performance with improvements such as 79.0% on Intent-QA (+6.2% over previous SOTA).

Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes