SDCLASJun 18, 2024

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

arXiv:2406.12611v15 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of language-specific adaptation in CTC-based speech recognition models, particularly benefiting low-resource languages, though it is incremental as it builds on existing self-conditioned CTC frameworks.

The paper tackled the problem of adapting Connectionist Temporal Classification (CTC) models for multilingual speech recognition by introducing an encoder prompting technique, which reduced errors by 28% on average and 41% on low-resource languages.

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes