CLMar 11, 2024

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

arXiv:2403.06570v23.45 citationsh-index: 2Odyssey

Originality Incremental advance

AI Analysis

This addresses speaker assignment accuracy in real meeting transcription, offering incremental improvements for applications like meeting analysis.

The study tackled the problem of speaker assignment in speaker-attributed ASR for real meetings by proposing a pipeline with VAD, SD, and SA-ASR, resulting in up to 28% relative reduction in Speaker Error Rate through fine-tuning and up to 20% reduction by improving speaker embedding extraction.

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.

View on arXiv PDF

Similar