Spatial-temporal Concept based Explanation of 3D ConvNets
This work addresses the need for interpretability in 3D video analysis, which is incremental as it extends concept-based explanation methods from 2D to 3D domains.
The paper tackles the problem of explaining 3D video recognition ConvNets, which is less studied due to computational costs and complexity, by presenting a 3D ACE framework that uses high-level supervoxels and importance scores to discover spatial-temporal concepts, enabling in-depth exploration of their influence on tasks like action classification.
Recent studies have achieved outstanding success in explaining 2D image recognition ConvNets. On the other hand, due to the computation cost and complexity of video data, the explanation of 3D video recognition ConvNets is relatively less studied. In this paper, we present a 3D ACE (Automatic Concept-based Explanation) framework for interpreting 3D ConvNets. In our approach: (1) videos are represented using high-level supervoxels, which is straightforward for human to understand; and (2) the interpreting framework estimates a score for each voxel, which reflects its importance in the decision procedure. Experiments show that our method can discover spatial-temporal concepts of different importance-levels, and thus can explore the influence of the concepts on a target task, such as action classification, in-depth. The codes are publicly available.