AS CL LGJun 12, 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

arXiv:2406.07890v15.111 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of low-resource child speech understanding for researchers and practitioners in speech processing, though it is incremental as it applies existing foundation models to a specific domain.

The paper tackled speaker diarization in child-adult dyadic interactions using speech foundation models, achieving 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate compared to previous methods.

Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.

View on arXiv PDF Code

Similar