AS CL SDNov 3, 2020

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka

arXiv:2011.02921v18.018 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving accuracy in speaker-attributed speech recognition for overlapped speech, representing an incremental advancement by optimizing the training criterion to better align with the evaluation metric.

The paper tackled the problem of training end-to-end speaker-attributed ASR models for overlapped speech by proposing a speaker-attributed minimum Bayes risk (SA-MBR) training method to directly minimize the expected speaker-attributed word error rate (SA-WER), resulting in a 9.0% relative reduction in SA-WER compared to the previous SA-MMI-trained model.

Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. In the previous study, the model parameters were trained based on the speaker-attributed maximum mutual information (SA-MMI) criterion, with which the joint posterior probability for multi-talker transcription and speaker identification are maximized over training data. Although SA-MMI training showed promising results for overlapped speech consisting of various numbers of speakers, the training criterion was not directly linked to the final evaluation metric, i.e., speaker-attributed word error rate (SA-WER). In this paper, we propose a speaker-attributed minimum Bayes risk (SA-MBR) training method where the parameters are trained to directly minimize the expected SA-WER over the training data. Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.

View on arXiv PDF Code

Similar