ASSDJan 24, 2022

Endpoint Detection for Streaming End-to-End Multi-talker ASR

arXiv:2201.09979v122 citations
AI Analysis

This addresses the need for prompt system responses in real applications like conversations or meetings, but it is incremental as it adapts existing single-talker methods to a multi-talker context.

The paper tackles endpoint detection for streaming multi-talker speech recognition by extending the SURT model with an end-of-sentence token and a latency penalty, achieving promising detection without significant accuracy loss on the 2-speaker LibrispeechMix dataset.

Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing an end-of-sentence token as an output unit, following the practice of single-talker end-to-end models. Furthermore, we also present a latency penalty approach that can significantly cut down the EP detection latency. Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes