SDASDec 23, 2020

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

arXiv:2012.12471v24 citations
AI Analysis

This work provides a principled solution for speaker recognition system developers to mitigate performance degradation caused by enrollment-test mismatch, which is a common and significant problem in real-world applications.

This paper addresses the performance degradation in speaker recognition systems due to enrollment-test mismatch by proposing a statistics decomposition (SD) approach. This method decomposes the PLDA score into three components, theoretically yielding an optimal score when correct statistics are applied to each. Experiments on three datasets with various mismatch types demonstrated its high effectiveness, outperforming the commonly used multi-condition training approach.

Mismatch between enrollment and test conditions causes serious performance degradation on speaker recognition systems. This paper presents a statistics decomposition (SD) approach to solve this problem. This approach decomposes the PLDA score into three components that corresponding to enrollment, prediction and normalization respectively. Given that correct statistics are used in each component, the resultant score is theoretically optimal. A comprehensive experimental study was conducted on three datasets with different types of mismatch: (1) physical channel mismatch, (2) speaking behavior mismatch, (3) near-far recording mismatch. The results demonstrated that the proposed SD approach is highly effective, and outperforms the ad-hoc multi-condition training approach that is commonly adopted but not optimal in theory.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes