SDASOct 12, 2021

Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays

arXiv:2110.05975v3
Originality Incremental advance
AI Analysis

This work addresses speaker verification for applications in noisy settings, but it is incremental as it builds on existing ad-hoc array methods by improving information aggregation.

The paper tackled speaker verification in adverse acoustic environments using ad-hoc microphone arrays by proposing frame-level aggregation with attention mechanisms to better utilize spatial-temporal information, achieving state-of-the-art performance with the graph-attention method outperforming self-attention in most cases.

Speaker verification based on ad-hoc microphone arrays has the potential of reducing the error significantly in adverse acoustic environments. However, existing approaches extract utterance-level speaker embeddings from each channel of an ad-hoc microphone array, which does not consider fully the spatial-temporal information across the devices. In this paper, we propose to aggregate the multichannel signals of the ad-hoc microphone array at the frame-level by exploring the cross-channel information deeply with two attention mechanisms. The first one is a self-attention method. It consists of a cross-frame self-attention layer and a cross-channel self-attention layer successively, both working at the frame level. The second one learns the cross-frame and cross-channel information via two graph attention layers. Experimental results demonstrate that the proposed methods reach the state-of-the-art performance. Moreover, the graph-attention method is better than the self-attention method in most cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes