CVLGSep 18, 2019

Class Feature Pyramids for Video Explanation

arXiv:1909.08611v119 citations
Originality Incremental advance
AI Analysis

This addresses the need for human-understandable explanations in video action recognition, which is crucial for improving model transparency and trust in applications like surveillance or autonomous systems, though it is incremental as it builds on existing interpretability techniques.

The paper tackled the problem of interpreting 3D convolutional networks for video action recognition by introducing Class Feature Pyramids, a method that generates visual explanations by identifying informative kernels across network depths, achieving broad applicability across six state-of-the-art models and five datasets.

Deep convolutional networks are widely used in video action recognition. 3D convolutions are one prominent approach to deal with the additional time dimension. While 3D convolutions typically lead to higher accuracies, the inner workings of the trained models are more difficult to interpret. We focus on creating human-understandable visual explanations that represent the hierarchical parts of spatio-temporal networks. We introduce Class Feature Pyramids, a method that traverses the entire network structure and incrementally discovers kernels at different network depths that are informative for a specific class. Our method does not depend on the network's architecture or the type of 3D convolutions, supporting grouped and depth-wise convolutions, convolutions in fibers, and convolutions in branches. We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs) on three action recognition (Kinetics-400, UCF-101, and HMDB-51) and two egocentric action recognition datasets (EPIC-Kitchens and EGTEA Gaze+).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes