AS LG SDSep 24, 2024

Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions

Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan

arXiv:2409.16135v21.21 citationsh-index: 22

Originality Synthesis-oriented

AI Analysis

This addresses the problem of reliable transcription for diagnosing autism in clinical settings, but it is incremental as it applies existing methods to a new domain.

The paper evaluated speech foundation models for automatic speech recognition (ASR) on child-adult conversations in autism diagnostic sessions, finding a 15-20% absolute WER drop for child speech compared to adult speech, and fine-tuning Whisper-large with LoRA improved WER by 8% for child and 13% for adult speech.

Reliable transcription of child-adult conversations in clinical settings is crucial for diagnosing developmental disorders like Autism. Recent advances in deep learning and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, their performance on conversational child-adult interactions remains underexplored. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we fine-tune the best-performing zero-shot model (Whisper-large) using LoRA in a low-resource setting, yielding 8% and 13% absolute WER improvements for child and adult speech, respectively.

View on arXiv PDF

Similar