SDLGASJul 23, 2020

Augmentation adversarial training for self-supervised speaker recognition

arXiv:2007.12085v379 citations
AI Analysis

This work addresses the challenge of speaker recognition without labeled data, offering a novel approach to enhance robustness in self-supervised learning, though it is incremental in improving existing contrastive methods.

The paper tackles the problem of training robust speaker recognition models without speaker labels by addressing the difficulty of separating speaker from channel information in contrastive learning. It proposes an augmentation adversarial training strategy that improves performance, achieving significant gains over previous self-supervised methods and exceeding human performance on VoxCeleb and VOiCES datasets.

The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes