CVMar 29, 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Peking U
arXiv:2503.22952v130 citationsh-index: 10CVPR
Originality Incremental advance
AI Analysis

This addresses the problem of assessing real-world interactive capabilities for researchers and developers in AI, though it is incremental as it builds on existing benchmarks.

The paper tackles the challenge of evaluating multi-modal language models in streaming video contexts by introducing OmniMMI, a benchmark with over 1,121 videos and 2,290 questions, and proposes the M4 framework for efficient streaming inference.

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes