CVOct 31, 2023

Object-centric Video Representation for Long-term Action Anticipation

arXiv:2311.00180v131 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the problem of predicting future human-object interactions in videos for applications like robotics and surveillance, presenting an incremental improvement by leveraging pretrained models without finetuning.

The paper tackled long-term action anticipation in videos by building object-centric representations using visual-language pretrained models with object prompts, achieving effectiveness confirmed through evaluations on benchmarks like Ego4D, 50Salads, and EGTEA Gaze+.

This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background" object could be used by the human actor in the future. We observe that existing object-based video recognition frameworks either assume the existence of in-domain supervised object detectors or follow a fully weakly-supervised pipeline to infer object locations from action labels. We propose to build object-centric video representations by leveraging visual-language pretrained models. This is achieved by "object prompts", an approach to extract task-specific object-centric representations from general-purpose pretrained models without finetuning. To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales. We conduct extensive evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both quantitative and qualitative results confirm the effectiveness of our proposed method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes