Train, Diagnose and Fix: Interpretable Approach for Fine-grained Action Recognition
This work addresses the interpretability problem for researchers and practitioners in fine-grained action recognition, offering a systematic approach to improve model performance through diagnosis and refinement.
The paper tackles the black-box nature of deep learning models in action recognition by proposing a three-stage paradigm of training, interpretable diagnosis, and targeted refinement, resulting in a Multi-stream Residual Temporal Convolutional Network that achieves state-of-the-art performance on the NTU RGB+D benchmark.
Despite the growing discriminative capabilities of modern deep learning methods for recognition tasks, the inner workings of the state-of-art models still remain mostly black-boxes. In this paper, we propose a systematic interpretation of model parameters and hidden representations of Residual Temporal Convolutional Networks (Res-TCN) for action recognition in time-series data. We also propose a Feature Map Decoder as part of the interpretation analysis, which outputs a representation of model's hidden variables in the same domain as the input. Such analysis empowers us to expose model's characteristic learning patterns in an interpretable way. For example, through the diagnosis analysis, we discovered that our model has learned to achieve view-point invariance by implicitly learning to perform rotational normalization of the input to a more discriminative view. Based on the findings from the model interpretation analysis, we propose a targeted refinement technique, which can generalize to various other recognition models. The proposed work introduces a three-stage paradigm for model learning: training, interpretable diagnosis and targeted refinement. We validate our approach on skeleton based 3D human action recognition benchmark of NTU RGB+D. We show that the proposed workflow is an effective model learning strategy and the resulting Multi-stream Residual Temporal Convolutional Network (MS-Res-TCN) achieves the state-of-the-art performance on NTU RGB+D.