OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
This work aims to improve mixed-modality reasoning for omnivideo models, which is an incremental improvement for researchers working on audio-visual understanding.
The paper addresses challenges in audio-visual understanding for omnivideo models by proposing OmniVideo-R1, a reinforced framework. This framework utilizes query-intensive grounding via self-supervised learning and modality-attentive fusion through contrastive learning, leading to consistent outperformance against strong baselines on multiple benchmarks.
While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.