CVNov 1, 2021

Gradient Frequency Modulation for Visually Explaining Video Understanding Models

arXiv:2111.01215v22 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better explainability in video understanding models, particularly for applications requiring interpretable AI decisions, though it is incremental as it builds on existing perturbation methods.

The paper tackles the problem of generating spatiotemporally consistent visual explanations for video action recognition models, which is often overlooked, by proposing Frequency-based Extremal Perturbation (F-EP) that modulates gradient frequencies using DCT to reduce noise and improve smoothness, resulting in more faithful explanations compared to state-of-the-art methods.

In many applications, it is essential to understand why a machine learning model makes the decisions it does, but this is inhibited by the black-box nature of state-of-the-art neural networks. Because of this, increasing attention has been paid to explainability in deep learning, including in the area of video understanding. Due to the temporal dimension of video data, the main challenge of explaining a video action recognition model is to produce spatiotemporally consistent visual explanations, which has been ignored in the existing literature. In this paper, we propose Frequency-based Extremal Perturbation (F-EP) to explain a video understanding model's decisions. Because the explanations given by perturbation methods are noisy and non-smooth both spatially and temporally, we propose to modulate the frequencies of gradient maps from the neural network model with a Discrete Cosine Transform (DCT). We show in a range of experiments that F-EP provides more spatiotemporally consistent explanations that more faithfully represent the model's decisions compared to the existing state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes