LGAICLMay 26, 2022

Global Normalization for Streaming Speech Recognition in a Modular Framework

arXiv:2205.13674v114 citationsh-index: 41
Originality Highly original
AI Analysis

This work addresses a key bottleneck in streaming speech recognition for applications requiring real-time processing, representing a strong specific gain rather than an incremental improvement.

The paper tackles the label bias problem in streaming speech recognition by introducing the Globally Normalized Autoregressive Transducer (GNAT), which reduces the word error rate gap between streaming and non-streaming models by more than 50% on the Librispeech dataset.

We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes