CVApr 4

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

arXiv:2604.0366759.61 citationsh-index: 37
Predicted impact top 58% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses a key problem for intelligent assistive systems by improving anticipation capabilities, though it appears incremental as it builds on existing VLLM methods.

The paper tackles human-object interaction anticipation from egocentric videos by enhancing visual grounding with Set-of-Mark prompting and gaze trajectories, achieving state-of-the-art results on the HD-EPIC dataset.

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes