ASCLSDSep 12, 2023

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

arXiv:2309.07927v343 citationsh-index: 15
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited ASR performance for children's speech, which is an incremental improvement over prior work.

This paper tackles the performance gap in automatic speech recognition for children versus adults by enhancing data preprocessing on the MyST children's speech corpus, reducing Word Error Rate from 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium, and showing generalization to unseen datasets.

Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes