CVMay 29, 2025

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei

arXiv:2505.23155v28.42 citationsh-index: 18Has Code

Originality Highly original

AI Analysis

This addresses the need for efficient, real-time multimodal video understanding, which is incremental as it adapts existing parsing tasks to online settings.

The paper tackles the problem of real-time audio-visual event parsing in video streams by introducing an online paradigm, achieving significant performance improvements with fewer parameters compared to state-of-the-art methods on benchmark datasets.

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

View on arXiv PDF Code

Similar