AI CVFeb 5

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu

arXiv:2602.05847v212.88 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work aims to improve mixed-modality reasoning for omnivideo models, which is an incremental improvement for researchers working on audio-visual understanding.

The paper addresses challenges in audio-visual understanding for omnivideo models by proposing OmniVideo-R1, a reinforced framework. This framework utilizes query-intensive grounding via self-supervised learning and modality-attentive fusion through contrastive learning, leading to consistent outperformance against strong baselines on multiple benchmarks.

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

View on arXiv PDF

Similar