CLMay 27, 2023

Bridging the Granularity Gap for Acoustic Modeling

arXiv:2305.17356v1225 citations
Originality Incremental advance
AI Analysis

This addresses the problem of capturing long-distance dependencies in speech processing for researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackles the challenge of modeling fine-grained frame-level acoustic features in Transformers for speech by proposing Progressive Down-Sampling (PDS) to compress features into coarser units, achieving better or comparable speech recognition performance with compression to 1/32 of initial length and inference speedups of 1.20× to 1.47×.

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20$\times$ to 1.47$\times$. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes