SDLGASFeb 20, 2021

Learnable MFCCs for Speaker Verification

arXiv:2102.10322v119 citations
Originality Incremental advance
AI Analysis

This work addresses speaker verification for security and biometric applications, offering an incremental enhancement by adapting traditional features to data.

The paper tackled the problem of improving speaker verification by proposing a learnable MFCC frontend architecture, achieving relative improvements of 6.7% on VoxCeleb1 and 9.7% on SITW in equal error rate compared to static MFCCs.

We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes