CVFeb 9

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

arXiv:2602.08439v14 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses the need for video understanding models to adapt from few examples, which is incremental as it builds on existing MLLM capabilities with a new benchmark and training strategy.

The paper tackles the problem of Multimodal Large Language Models lacking the ability to learn from dynamic, novel contexts in videos by introducing Demo-ICL, a task and benchmark for in-context learning from video demonstrations, and shows that their proposed model outperforms state-of-the-art methods on this challenging benchmark.

Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes