CVMar 17, 2025

ViSpeak: Visual Instruction Feedback in Streaming Videos

arXiv:2503.12769v127 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the problem of time-sensitive and interactive video understanding for AI agents, though it is incremental as it builds on existing LMMs with a new dataset and task.

The paper tackles the challenge of streaming video understanding by introducing a new task called Visual Instruction Feedback, where models learn to extract instructions from visual content to enhance user-agent interactions, and proposes the ViSpeak model that achieves GPT-4o-level performance on benchmarks.

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes