CLNov 15, 2018

Streaming End-to-end Speech Recognition For Mobile Devices

Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia

arXiv:1811.06621v121.2673 citationsh-index: 70

Originality Incremental advance

AI Analysis

This work addresses the need for real-time, accurate on-device speech recognition, which is incremental as it builds upon existing E2E methods with improvements in performance.

The authors tackled the challenge of building a streaming end-to-end speech recognizer for mobile devices, achieving lower latency and higher accuracy than a conventional CTC-based model in several evaluation categories.

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.

View on arXiv PDF

Similar