CVApr 26, 2025

3DPyranet Features Fusion for Spatio-temporal Feature Learning

arXiv:2504.18977v1h-index: 32
Originality Incremental advance
AI Analysis

This work addresses video recognition challenges, such as human action and dynamic scene analysis, with a novel method that reduces computational costs, though it is incremental in improving existing deep learning approaches.

The paper tackles the problem of high parameter counts in deep CNNs for video analysis by proposing 3DPyraNet, a 3D pyramidal neural network with a new weighting scheme for spatio-temporal feature learning, and 3DPyraNet-F, which fuses features for classification, achieving state-of-the-art performance on three benchmark datasets and comparable results on a fourth.

Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes