SDASDec 23, 2021

Graph attentive feature aggregation for text-independent speaker verification

arXiv:2112.12343v118 citations
Originality Incremental advance
AI Analysis

This work addresses speaker verification, a domain-specific task in speech processing, with incremental improvements to feature aggregation.

The paper tackles the problem of aggregating frame-level features into utterance-level representations for text-independent speaker verification by proposing a graph attentive feature aggregation module that models pairwise relationships directly. The method achieves over 10% relative improvement compared to baselines when integrated with SE-ResNet and RawNet2 systems.

The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes