Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks
This work addresses a bottleneck in ASR for handling variable-duration speech, offering an incremental improvement over existing Transformer methods.
The authors tackled the problem of fixed-length attention windows in Transformer-based ASR, which can cause data over-smoothing and neglect long-term connectivity in variable-length speech, by introducing Echo-MSA, a variable-length attention module that improved word error rate performance while maintaining model stability.
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.