LGAICLHCSDASFeb 1, 2022

Visualizing Automatic Speech Recognition -- Means for a Better Understanding?

arXiv:2202.00673v111 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the interpretability of ASR systems for researchers and practitioners, but it is incremental as it adapts existing visualization techniques to a new domain.

The paper tackled the problem of understanding how automatic speech recognition (ASR) models work by applying attribution methods from image recognition to audio data, using DeepSpeech as a case study to visualize influential input features.

Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes