EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
This addresses the need for step-level intent understanding in applications like intelligent assistants and robotics, though it is incremental as it builds on existing MLLM capabilities with a new benchmark.
The authors tackled the problem of fine-grained step-level intent understanding in egocentric videos, which is overlooked by existing benchmarks, by introducing EgoIntent—a benchmark with 3,014 steps across 15 scenarios that evaluates models on what, why, and next-step dimensions. The best-performing model achieved only 33.31 average score, showing the task remains highly challenging.
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.