AS LG SDSep 14, 2023

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

Guanlong Zhao, Yongqiang Wang, Jason Pelecanos, Yu Zhang, Hank Liao, Yiling Huang, Han Lu, Quan Wang

Meta AI

arXiv:2309.08023v35.16 citationsh-index: 36

Originality Incremental advance

AI Analysis

This provides a multilingual solution for speaker diarization and transcription, though it is incremental as it adapts an existing foundation model.

The paper tackled speaker change detection and ASR for 96 languages by fine-tuning a large pretrained foundation model, achieving over 75% average F1 score across languages and an 85.8% F1 score on American English with a 21% relative improvement over a baseline.

We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune one-quarter of the trainable model parameters to achieve the best model performance. The USM-SCD model exhibits state-of-the-art ASR quality compared with a strong public ASR baseline, making it suitable to handle both tasks with negligible additional computational cost.

View on arXiv PDF

Similar