CV LGOct 20, 2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell

arXiv:2110.10596v212.130 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of automatically linking video narrations to spatial regions, which could benefit video understanding and retrieval applications, though it builds incrementally on existing multimodal attention methods.

The paper tackles the problem of spatially localizing narrated interactions in instructional videos by introducing a self-supervised approach using a multilayer cross-modal attention network, which outperforms baselines on a new dataset and achieves state-of-the-art results on image grounding tasks.

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

View on arXiv PDF

Similar