CVCLLGApr 8, 2019

Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions

arXiv:1904.03885v11092 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of object reference in videos for computer vision and natural language processing applications, presenting an incremental improvement through a novel method for a known bottleneck.

The paper tackles the problem of grounding spatio-temporal identifying descriptions in videos by introducing a new data collection scheme and a two-stream modular attention network that leverages appearance and motion. The result shows that motion modules improve grounding of motion-related words and enhance learning in appearance modules, with modular networks reducing task interference.

This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes