CLNESDASNov 4, 2019

What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis

arXiv:1911.01102v132 citations
Originality Incremental advance
AI Analysis

This work provides a novel analysis method for understanding layer-wise transformations in end-to-end ASR models, which is incremental in applying synthesis to probe representations.

The authors tackled the problem of analyzing hidden representations in end-to-end speech recognition systems by synthesizing speech from each layer to examine information retention. They observed gradual removal of speaker variability and noise as layers deepen, confirming prior insights into deep network functions in speech recognition.

End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the layer goes deeper, which aligns with the previous studies on how deep network functions in speech recognition. This paper is the first study analyzing the end-to-end speech recognition model by demonstrating what each layer hears. Speaker verification and speech enhancement measurements on synthesized speech are also conducted to confirm our observation further.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes