ASCLSDAug 29, 2022

A Language Agnostic Multilingual Streaming On-Device ASR System

arXiv:2208.13916v114 citationsh-index: 69
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient, real-time multilingual speech recognition on mobile devices, representing an incremental improvement over previous capacity solutions.

The paper tackles the challenge of building a streaming multilingual automatic speech recognition (ASR) system that runs fully on-device, achieving comparable quality and latency to monolingual models while supporting real-time intersentential code switching.

On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device with comparable quality and latency to individual monolingual models. To achieve that, we propose an Encoder Endpointer model and an End-of-Utterance (EOU) Joint Layer for a better quality and latency trade-off. Our system is built in a language agnostic manner allowing it to natively support intersentential code switching in real time. To address the feasibility concerns on large models, we conducted on-device profiling and replaced the time consuming LSTM decoder with the recently developed Embedding decoder. With these changes, we managed to run such a system on a mobile device in less than real time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes