CLSDASFeb 27, 2024

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

arXiv:2402.17184v111 citationsh-index: 36ICASSP
Originality Incremental advance
AI Analysis

This addresses the deployment bottleneck of large ASR models for real-time applications by reducing computational latencies.

The paper tackles the computational inefficiency of large end-to-end ASR models by applying multiple frame reduction layers in the encoder to compress output frames, achieving one output frame per 2.56 seconds of input speech without significantly affecting word error rate, while improving encoder and decoder latencies by 48% and 92% respectively.

The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes