LG CLDec 14, 2020

A review of on-device fully neural end-to-end automatic speech recognition algorithms

Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han

arXiv:2012.07974v38.532 citations

Originality Synthesis-oriented

AI Analysis

This review is significant for researchers and developers working on on-device ASR, as it consolidates information on memory-efficient neural network-based solutions.

This paper reviews fully neural end-to-end automatic speech recognition (ASR) algorithms, which overcome the memory limitations of conventional ASR systems (up to 100 GB language models) by requiring significantly smaller memory footprints, enabling on-device implementation.

In this paper, we review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on. To obtain sufficiently high speech recognition accuracy with such conventional speech recognition systems, a very large language model (up to 100 GB) is usually needed. Hence, the corresponding WFST size becomes enormous, which prohibits their on-device implementation. Recently, fully neural network end-to-end speech recognition algorithms have been proposed. Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and so on. These fully neural network-based systems require much smaller memory footprints compared to conventional algorithms, therefore their on-device implementation has become feasible. In this paper, we review such end-to-end speech recognition models. We extensively discuss their structures, performance, and advantages compared to conventional algorithms.

View on arXiv PDF

Similar