ASCLJun 2, 2023

Streaming Speech-to-Confusion Network Speech Recognition

arXiv:2306.03778v1h-index: 70
Originality Incremental advance
AI Analysis

This addresses the need for efficient, low-latency ASR in interactive applications like voice assistants, though it is incremental as it builds on existing neural ASR methods.

The paper tackles the problem of low-latency speech recognition in interactive systems by introducing a streaming ASR architecture that outputs confusion networks, achieving 1-best results comparable to RNN-T and enabling second-pass rescoring to reduce word error rate by 10-20% on LibriSpeech.

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes