SDCLASSPJun 20, 2023

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

arXiv:2306.11309v16 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the challenge of accented speech recognition for practical ASR applications, particularly in low-resource settings, with incremental improvements over existing methods.

The paper tackles low-resource accented speech recognition by proposing a Conformer-based architecture with multi-pass training and cross-information fusion, achieving relative word/character error rate reductions of 10.2% to 24.5% on accented English and Mandarin test sets compared to a Conformer baseline.

Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extract complementary acoustic information. Moreover, we propose to train the Aformer in a multi-pass manner, and investigate three cross-information fusion methods to effectively combine the information from both general and accent encoders. All experiments are conducted on both the accented English and Mandarin ASR tasks. Results show that our proposed methods outperform the strong Conformer baseline by relative 10.2% to 24.5% word/character error rate reduction on six in-domain and out-of-domain accented test sets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes