CV AIMay 11

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

arXiv:2605.1034388.91 citations

Predicted impact top 17% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For developers of video-language models, this work provides a data-efficient method to adapt offline models for real-time streaming interaction, addressing the lack of interaction policy in existing models.

The paper introduces RealStreamEval, a frame-level evaluation protocol for streaming video assistants that penalizes unnecessary responses, and proposes EvoStreaming, a self-evolved adaptation framework that uses only 1,000 self-generated samples to improve streaming scores by up to 10.8 points across five VideoLLM backbones without architectural changes.

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

View on arXiv PDF

Similar