CVJul 14, 2017

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

arXiv:1707.04555v162 citations
Originality Synthesis-oriented
AI Analysis

This work addresses video understanding for large-scale applications, but it is incremental as it builds on existing temporal modeling methods.

The paper tackled multi-label video recognition on the YouTube-8M dataset by investigating temporal modeling approaches to aggregate frame-level features, achieving 82.75% GAP@20 on the Kaggle Public test set with a 7-layer fast-forward LSTM.

This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of the raw videos, we mainly investigate various temporal modeling approaches to aggregate the frame-level features for multi-label video recognition. Our system contains three major components: two-stream sequence model, fast-forward sequence model and temporal residual neural networks. Experiment results on the challenging Youtube-8M dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing temporal modeling approaches in the large-scale video recognition tasks. To be noted, our fast-forward LSTM with a depth of 7 layers achieves 82.75% in term of GAP@20 on the Kaggle Public test set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes