Learnable MFCCs for Speaker Verification
This work addresses speaker verification for security and biometric applications, offering an incremental enhancement by adapting traditional features to data.
The paper tackled the problem of improving speaker verification by proposing a learnable MFCC frontend architecture, achieving relative improvements of 6.7% on VoxCeleb1 and 9.7% on SITW in equal error rate compared to static MFCCs.
We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort.