CV NE IVNov 3, 2017

Convolutional Drift Networks for Video Classification

Dillon Graham, Seyed Hamed Fatemi Langroudi, Christopher Kanan, Dhireesha Kudithipudi

arXiv:1711.01201v13.111 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of building fully trainable systems for video analysis, offering a simpler alternative to complex methods like recurrent neural networks, though it appears incremental as it builds on existing techniques.

The authors tackled video classification by introducing the Convolutional Drift Network (CDN), which combines convolutional neural networks with reservoir computing to efficiently process spatio-temporal data without hand-crafted features, achieving results on-par with state-of-the-art methods on two egocentric video datasets while only training a single feed-forward layer.

Analyzing spatio-temporal data like video is a challenging task that requires processing visual and temporal information effectively. Convolutional Neural Networks have shown promise as baseline fixed feature extractors through transfer learning, a technique that helps minimize the training cost on visual information. Temporal information is often handled using hand-crafted features or Recurrent Neural Networks, but this can be overly specific or prohibitively complex. Building a fully trainable system that can efficiently analyze spatio-temporal data without hand-crafted features or complex training is an open challenge. We present a new neural network architecture to address this challenge, the Convolutional Drift Network (CDN). Our CDN architecture combines the visual feature extraction power of deep Convolutional Neural Networks with the intrinsically efficient temporal processing provided by Reservoir Computing. In this introductory paper on the CDN, we provide a very simple baseline implementation tested on two egocentric (first-person) video activity datasets.We achieve video-level activity classification results on-par with state-of-the art methods. Notably, performance on this complex spatio-temporal task was produced by only training a single feed-forward layer in the CDN.

View on arXiv PDF

Similar