AS LG MM SDFeb 26, 2019

Utterance-level Aggregation For Speaker Recognition In The Wild

Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

arXiv:1902.10107v2368 citations

Originality Incremental advance

AI Analysis

This addresses speaker recognition for applications in noisy, unconstrained environments, representing a strong incremental improvement over prior methods.

The paper tackled speaker recognition in challenging real-world conditions with variable-length utterances and irrelevant signals by proposing a deep network with a thin-ResNet trunk and dictionary-based temporal aggregation layers, achieving state-of-the-art performance on the VoxCeleb1 test set with fewer parameters.

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a "thin-ResNet" trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for "in the wild" data, a longer length is beneficial.

View on arXiv PDF

Similar