CVJan 4, 2018

Object Referring in Videos with Language and Human Gaze

arXiv:1801.01582v286 citations
Originality Incremental advance
AI Analysis

This addresses the problem of more accurate object localization in videos for computer vision applications, representing an incremental advance by extending static image methods to dynamic video contexts.

The paper tackles object referring in videos by localizing target objects using language descriptions and human gaze, introducing a new dataset of 30,000 objects in 5,000 stereo videos and a network integrating appearance, motion, gaze, and spatio-temporal context, which outperforms previous methods.

We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previousOR methods. For dataset and code, please refer https://people.ee.ethz.ch/~arunv/ORGaze.html.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes