CLOct 7, 2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

arXiv:1910.02930v11020 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better automatic captioning in instructional videos to enhance user experiences, though it is incremental by building on prior multimodal approaches.

The study tackled the problem of automatically generating subtask annotations for instructional videos by combining automatic speech recognition (ASR) tokens and visual features, resulting in significantly improved performance compared to using visual features alone.

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes