AIMay 27

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

arXiv:2605.2819286.8Has Code
AI Analysis

For researchers in multimodal AI, this work addresses the challenge of multi-hop reasoning over temporally dispersed audio-visual evidence, which current Omni-LLMs struggle with.

The paper introduces MOV-Bench, a benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that improves Omni-LLMs' performance on this task, achieving notable gains on long videos and reasoning-intensive questions.

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes