CLOct 29, 2024

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

arXiv:2410.21849v21 citationsh-index: 2EUSIPCO
Originality Incremental advance
AI Analysis

This addresses the challenge of real meeting transcription for applications like conferencing, though it is incremental as it builds on existing SA-ASR with neural beamforming improvements.

The paper tackled the problem of distant-microphone meeting transcription by introducing a joint beamforming and speaker-attributed ASR approach, which reduced word error rates by 8% and 9% relative compared to state-of-the-art methods on the AMI corpus.

Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that, while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer's output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes