CVLGNENov 19, 2015

Delving Deeper into Convolutional Networks for Learning Video Representations

arXiv:1511.06432v4789 citations
Originality Incremental advance
AI Analysis

This addresses video representation learning for tasks like action recognition and captioning, but it is incremental as it builds on existing methods with modifications.

The authors tackled learning spatio-temporal features in videos by using GRUs with percepts from a deep convolutional network, achieving results equivalent to state-of-the-art on the YouTube2Text dataset without extra 3D CNN features.

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. Using low-level percepts can leads to high-dimensionality video representations. To mitigate this effect and control the model number of parameters, we introduce a variant of the GRU model that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations. We empirically validate our approach on both Human Action Recognition and Video Captioning tasks. In particular, we achieve results equivalent to state-of-art on the YouTube2Text dataset using a simpler text-decoder model and without extra 3D CNN features.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes