SDAIASOct 21, 2021

Optimizing Multi-Taper Features for Deep Speaker Verification

arXiv:2110.10983v1
Originality Incremental advance
AI Analysis

This work addresses speaker verification accuracy for speech processing applications, representing an incremental improvement by optimizing existing multi-taper features for deep learning.

The paper tackled the problem of optimizing multi-taper estimators for deep speaker verification systems, achieving a 25.8% improvement in equal error rate on the SITW corpus compared to static-taper methods.

Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with deep ASV systems remains an open question. Instead of a static-taper design, we propose to optimize the multi-taper estimator jointly with a deep neural network trained for ASV tasks. With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-taper, our method helps preserve a balanced level of leakage and variance, providing more robustness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes