CV CL LGApr 8, 2019

Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions

Peratham Wiriyathammabhum, Abhinav Shrivastava, Vlad I. Morariu, Larry S. Davis

arXiv:1904.03885v151.01092 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of object reference in videos for computer vision and natural language processing applications, presenting an incremental improvement through a novel method for a known bottleneck.

The paper tackles the problem of grounding spatio-temporal identifying descriptions in videos by introducing a new data collection scheme and a two-stream modular attention network that leverages appearance and motion. The result shows that motion modules improve grounding of motion-related words and enhance learning in appearance modules, with modular networks reducing task interference.

This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

View on arXiv PDF

Similar