SD CL ASNov 1, 2022

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Shaan Bijwadia, Shuo-yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Yanzhang He

arXiv:2211.00786v110.59 citationsh-index: 69

Originality Incremental advance

AI Analysis

This work addresses the need for faster and more efficient speech systems by integrating endpointing into speech recognition, offering incremental improvements in latency and accuracy for applications like voice search.

The authors tackled the problem of separate speech recognition and endpointing models by proposing a unified end-to-end multitask model, which reduced median endpoint latency by 120 ms (30.8%) and improved word error rate by 10.6% for continuous recognition without regressing performance.

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).

View on arXiv PDF

Similar