Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
This addresses speaker assignment accuracy in real meeting transcription, offering incremental improvements for applications like meeting analysis.
The study tackled the problem of speaker assignment in speaker-attributed ASR for real meetings by proposing a pipeline with VAD, SD, and SA-ASR, resulting in up to 28% relative reduction in Speaker Error Rate through fine-tuning and up to 20% reduction by improving speaker embedding extraction.
Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.