CVMar 7, 2017

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

arXiv:1703.02521v257 citations
Originality Incremental advance
AI Analysis

This work addresses reference resolution for instructional video analysis, but it is incremental as it builds on existing models with a hybrid approach.

The paper tackles the problem of unsupervised reference resolution in instructional videos by linking entities to actions, addressing visual-linguistic ambiguities without supervision. It shows that a joint visual-linguistic model improves upon state-of-the-art linguistic-only models using over two thousand unstructured cooking videos.

We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., "dressing") to the action (e.g., "mix yogurt") that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervision. We address these challenges by learning a joint visual-linguistic model, where linguistic cues can help resolve visual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially improve upon state-of-the-art linguistic only model on reference resolution in instructional videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes