Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report
This is an incremental improvement for the specific task of text-dependent speaker verification in a challenge setting.
The team achieved a MinDCF of 0.0461 and EER of 1.3% in the 2024 Text-Dependent Speaker Verification Challenge by adapting existing neural networks and designing a lightweight EfficientNet-A0 model, demonstrating strong performance through ensemble learning.
This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.