CVDec 3, 2025

Towards Object-centric Understanding for Instructional Videos

arXiv:2512.03479v11 citationsh-index: 1
AI Analysis

This work addresses the problem of flexible step order in real-world tasks for assistive AI, representing an incremental advancement in video understanding benchmarks.

The paper tackles the challenge of understanding procedural activities in instructional videos by shifting from action-centric to object-centric reasoning, introducing the Object-IVQA benchmark with 107 videos and 514 question-answer pairs, and proposing an agent framework that achieves substantial improvements over existing models.

Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes