Revealing the Role of Audio Channels in ASR Performance Degradation
This addresses a critical issue for ASR systems in real-world applications where audio quality varies, though it is incremental as it builds on known channel mismatch problems.
The study tackled the problem of ASR performance degradation due to different recording channels by proposing a normalization technique that aligns internal feature representations with a clean reference channel, resulting in significant improvements on unseen channels and languages.
Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences.