ASLGSep 23, 2021

ChannelAugment: Improving generalization of multi-channel ASR by training with input channel randomization

arXiv:2109.11225v1
Originality Incremental advance
AI Analysis

This addresses a practical deployment challenge for far-field ASR systems by improving robustness to array variations, though it is an incremental improvement over existing augmentation methods.

The paper tackles the problem of multi-channel ASR systems degrading in accuracy when tested with different microphone array geometries, by proposing ChannelAugment, a data augmentation technique that randomly drops channels during training. The results show a 10.6% WER improvement for Spatial Filtering across various arrays and a 74% reduction in training time for MVDR without accuracy loss.

End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks by joint training of a multi-channel front-end along with the ASR model. The main limitation of such systems is that they are usually trained with data from a fixed array geometry, which can lead to degradation in accuracy when a different array is used in testing. This makes it challenging to deploy these systems in practice, as it is costly to retrain and deploy different models for various array configurations. To address this, we present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training, in order to improve the robustness to various array configurations at test time. We call this technique ChannelAugment, in contrast to SpecAugment (SA) which drops time and/or frequency components of a single channel input audio. We apply ChannelAugment to the Spatial Filtering (SF) and Minimum Variance Distortionless Response (MVDR) neural beamforming approaches. For SF, we observe 10.6% WER improvement across various array configurations employing different numbers of microphones. For MVDR, we achieve a 74% reduction in training time without causing degradation of recognition accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes