CLAISDASMay 19, 2025

Improving endpoint detection in end-to-end streaming ASR for conversational speech

arXiv:2505.17070v1h-index: 15
Originality Incremental advance
AI Analysis

This work addresses endpoint detection issues in conversational ASR systems, which is crucial for user experience in products like voice assistants, but it appears incremental as it builds on existing transducer-based methods.

The paper tackled the problem of delayed emission and endpoint detection errors in transducer-based streaming ASR for conversational speech, proposing methods that improved performance on the Switchboard corpus compared to a baseline delay penalty approach.

ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes