Examining Test-Time Adaptation for Personalized Child Speech Recognition
This work addresses the problem of domain shifts in child speech recognition for ASR systems, but it is incremental as it systematically applies existing TTA methods to a new context.
The study tackled performance degradation in automatic speech recognition for child speakers by applying test-time adaptation methods, finding that these methods significantly improved both off-the-shelf and fine-tuned models compared to unadapted baselines, though limitations remained with non-linguistic speech.
Automatic speech recognition (ASR) models often experience performance degradation due to data domain shifts introduced at test time, a challenge that is further amplified for child speakers. Test-time adaptation (TTA) methods have shown great potential in bridging this domain gap. However, the use of TTA to adapt ASR models to the individual differences in each child's speech has not yet been systematically studied. In this work, we investigate the effectiveness of two widely used TTA methods-SUTA, SGEM-in adapting off-the-shelf ASR models and their fine-tuned versions for child speech recognition, with the goal of enabling continuous, unsupervised adaptation at test time. Our findings show that TTA significantly improves the performance of both off-the-shelf and fine-tuned ASR models, both on average and across individual child speakers, compared to unadapted baselines. However, while TTA helps adapt to individual variability, it may still be limited with non-linguistic child speech.