AS CL SDOct 27, 2020

Cascaded encoders for unifying streaming and non-streaming ASR

Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

arXiv:2010.14606v121.488 citations

Originality Incremental advance

AI Analysis

This addresses the need for flexible ASR systems that can handle both real-time and offline processing, particularly improving performance on long-form speech, though it is incremental in combining existing encoder types.

The paper tackles the problem of building a single end-to-end automatic speech recognition model that can operate in both streaming and non-streaming modes, achieving similar word error rates as standalone streaming models in streaming mode and 10% to 27% relative improvement in non-streaming mode.

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.

View on arXiv PDF

Similar