SDLGASJun 13, 2024

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

arXiv:2406.08914v12 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of training speech separation models for ASR without requiring reference transcriptions, which is often not available in real-world scenarios, representing a domain-specific incremental advancement.

The paper tackles the problem of automatic speech recognition for overlapping speakers in noisy and reverberant conditions by proposing a transcription-free method for joint training of speech separation and ASR models using only audio signals, achieving a 6.4% improvement in word error rate over a baseline signal-level loss.

One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes