CL SD ASOct 13, 2022

JOIST: A Joint Speech and Text Streaming Model For ASR

Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

arXiv:2210.07353v15.439 citationsh-index: 69

Originality Incremental advance

AI Analysis

This addresses the problem of enhancing ASR accuracy for real-time applications like voice assistants, though it appears incremental as it builds on existing streaming E2E models with joint training and more data.

The paper tackles the problem of improving streaming automatic speech recognition (ASR) by jointly training a model with both speech-text paired and text-only unpaired inputs, rather than using pre-training and fine-tuning. The result is a 4-14% relative improvement in word error rate (WER) across various test sets compared to models without text training, while maintaining streaming capabilities.

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.

View on arXiv PDF

Similar