SDASJul 1, 2021

Attention-based multi-channel speaker verification with ad-hoc microphone arrays

arXiv:2107.00178v1
Originality Incremental advance
AI Analysis

This addresses the problem of robust speaker verification in unpredictable ad-hoc microphone setups, offering a practical solution for applications like smart devices, but it is incremental as it builds on existing single-channel systems with novel adaptations.

The paper tackles speaker verification with ad-hoc microphone arrays, where microphone arrangements are unknown, by proposing an attention-based multi-channel method that uses residual self-attention and sparsemax to weight channels, achieving over 20% lower equal error rate on semi-real data and over 30% lower on simulation data compared to an oracle one-best system.

Recently, ad-hoc microphone array has been widely studied. Unlike traditional microphone array settings, the spatial arrangement and number of microphones of ad-hoc microphone arrays are not known in advance, which hinders the adaptation of traditional speaker verification technologies to ad-hoc microphone arrays. To overcome this weakness, in this paper, we propose attention-based multi-channel speaker verification with ad-hoc microphone arrays. Specifically, we add an inter-channel processing layer and a global fusion layer after the pooling layer of a single-channel speaker verification system. The inter-channel processing layer applies a so-called residual self-attention along the channel dimension for allocating weights to different microphones. The global fusion layer integrates all channels in a way that is independent to the number of the input channels. We further replace the softmax operator in the residual self-attention with sparsemax, which forces the channel weights of very noisy channels to zero. Experimental results with ad-hoc microphone arrays of over 30 channels demonstrate the effectiveness of the proposed methods. For example, the multi-channel speaker verification with sparsemax achieves an equal error rate (EER) of over 20% lower than oracle one-best system on semi-real data sets, and over 30% lower on simulation data sets, in test scenarios with both matched and mismatched channel numbers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes