CVDec 29, 2025

Active Perception Agent for Omnimodal Audio-Video Understanding

arXiv:2512.23646v25 citationsh-index: 6Has Code
Originality Highly original
AI Analysis

This addresses challenges in audio-video understanding for AI systems, representing a paradigm shift rather than an incremental improvement.

The paper tackles the problem of fine-grained cross-modal understanding and multimodal alignment in omnimodal audio-video models by introducing OmniAgent, an active perception agent that dynamically orchestrates unimodal tools, achieving state-of-the-art performance with 10% - 20% accuracy gains on benchmarks without training.

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and closed-source models by substantial margins of 10% - 20% accuracy without training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes