Attention-based multi-channel speaker verification with ad-hoc microphone arrays
This addresses the problem of robust speaker verification in unpredictable ad-hoc microphone setups, offering a practical solution for applications like smart devices, but it is incremental as it builds on existing single-channel systems with novel adaptations.
The paper tackles speaker verification with ad-hoc microphone arrays, where microphone arrangements are unknown, by proposing an attention-based multi-channel method that uses residual self-attention and sparsemax to weight channels, achieving over 20% lower equal error rate on semi-real data and over 30% lower on simulation data compared to an oracle one-best system.
Recently, ad-hoc microphone array has been widely studied. Unlike traditional microphone array settings, the spatial arrangement and number of microphones of ad-hoc microphone arrays are not known in advance, which hinders the adaptation of traditional speaker verification technologies to ad-hoc microphone arrays. To overcome this weakness, in this paper, we propose attention-based multi-channel speaker verification with ad-hoc microphone arrays. Specifically, we add an inter-channel processing layer and a global fusion layer after the pooling layer of a single-channel speaker verification system. The inter-channel processing layer applies a so-called residual self-attention along the channel dimension for allocating weights to different microphones. The global fusion layer integrates all channels in a way that is independent to the number of the input channels. We further replace the softmax operator in the residual self-attention with sparsemax, which forces the channel weights of very noisy channels to zero. Experimental results with ad-hoc microphone arrays of over 30 channels demonstrate the effectiveness of the proposed methods. For example, the multi-channel speaker verification with sparsemax achieves an equal error rate (EER) of over 20% lower than oracle one-best system on semi-real data sets, and over 30% lower on simulation data sets, in test scenarios with both matched and mismatched channel numbers.