Spiking Two-Stream Methods with Unsupervised STDP-based Learning for Action Recognition
This work addresses the problem of high computational costs and large labeled data requirements in video analysis for applications like surveillance and autonomous vehicles, though it is incremental in exploring spiking networks for this domain.
The paper tackled action recognition in videos by using Convolutional Spiking Neural Networks (CSNNs) with unsupervised STDP learning, showing that two-stream CSNNs can extract spatio-temporal information with limited training data and that spiking spatial and temporal streams are complementary, but adding a spatio-temporal stream caused redundancy without performance improvement.
Video analysis is a computer vision task that is useful for many applications like surveillance, human-machine interaction, and autonomous vehicles. Deep Convolutional Neural Networks (CNNs) are currently the state-of-the-art methods for video analysis. However they have high computational costs, and need a large amount of labeled data for training. In this paper, we use Convolutional Spiking Neural Networks (CSNNs) trained with the unsupervised Spike Timing-Dependent Plasticity (STDP) learning rule for action classification. These networks represent the information using asynchronous low-energy spikes. This allows the network to be more energy efficient and neuromorphic hardware-friendly. However, the behaviour of CSNNs is not studied enough with spatio-temporal computer vision models. Therefore, we explore transposing two-stream neural networks into the spiking domain. Implementing this model with unsupervised STDP-based CSNNs allows us to further study the performance of these networks with video analysis. In this work, we show that two-stream CSNNs can successfully extract spatio-temporal information from videos despite using limited training data, and that the spiking spatial and temporal streams are complementary. We also show that using a spatio-temporal stream within a spiking STDP-based two-stream architecture leads to information redundancy and does not improve the performance.