CVOct 27, 2025

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

arXiv:2510.23569v114 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This work addresses a core challenge in AI for robotics and human-computer interaction by enabling better understanding of egocentric videos, though it is incremental as it builds on existing multimodal models with new training techniques.

The paper tackles the problem of egocentric video reasoning, where models must infer hidden intentions and fine-grained interactions from first-person perspectives, by introducing EgoThinker, a framework that enhances multimodal large language models with spatio-temporal chain-of-thought supervision, resulting in outperforming existing methods on multiple benchmarks and achieving substantial improvements in localization tasks.

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes