ASSDAug 9, 2020

Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

arXiv:2008.03756v17 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of obtaining large labeled datasets for speaker recognition, especially under privacy constraints, by leveraging unlabeled data without requiring it to be from the same speakers.

The paper tackles the problem of training speaker-discriminative acoustic embeddings with limited labeled data by proposing a semi-supervised learning technique called cosine-distance virtual adversarial training (CD-VAT), which reduces the equal error rate by 11.1% relative to a supervised baseline on the VoxCeleb dataset.

In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by leveraging unlabelled data. The technique is a variant of virtual adversarial training (VAT) [1] in the form of a loss that is defined as the robustness of the speaker embedding against input perturbations, as measured by the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparison to many existing SSL techniques, the unlabelled data does not have to come from the same set of classes (here speakers) as the labelled data. The effectiveness of CD-VAT is shown on the 2750+ hour VoxCeleb data set, where on a speaker verification task it achieves a reduction in equal error rate (EER) of 11.1% relative to a purely supervised baseline. This is 32.5% of the improvement that would be achieved from supervised training if the speaker labels for the unlabelled data were available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes