AS LG SDApr 13, 2022

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw

arXiv:2204.06164v39.717 citationsh-index: 69

Originality Incremental advance

AI Analysis

This work addresses the need for efficient ASR deployment in resource-constrained environments, offering incremental improvements in model compression and unification.

The paper tackles the problem of deploying Automatic Speech Recognition (ASR) models across different scenarios by proposing a dynamic cascaded encoder model that unifies models for various sizes, resulting in a 30% smaller model size and 33% lower power consumption without quality loss, and a 37% total size reduction for a unified triple-size model with minimal quality impact.

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnel-pooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss, while substantially reducing the engineering efforts of having separate models.

View on arXiv PDF

Similar