ASLGSDOct 17, 2023

Zipformer: A faster and better encoder for automatic speech recognition

arXiv:2310.11230v4156 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This work addresses the need for more efficient and accurate ASR models, which is incremental as it builds upon the Conformer with specific architectural and optimization improvements.

The authors tackled the problem of improving automatic speech recognition (ASR) by developing Zipformer, a faster and more memory-efficient encoder that outperforms the Conformer, achieving state-of-the-art results on datasets like LibriSpeech, Aishell-1, and WenetSpeech.

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes