ASCLLGNov 16, 2024

Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024

arXiv:2411.10828v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses speaker verification for bilingual speakers in a text-dependent context, showing incremental improvements through pre-trained models.

The paper tackled the TdSV Challenge 2024 by developing bilingual text-dependent speaker verification systems using pre-trained models, achieving a MinDCF of 0.0358 and winning the challenge.

This paper presents our submissions to the Iranian division of the Text-dependent Speaker Verification Challenge (TdSV) 2024. TdSV aims to determine if a specific phrase was spoken by a target speaker. We developed two independent subsystems based on pre-trained models: For phrase verification, a phrase classifier rejected incorrect phrases, while for speaker verification, a pre-trained ResNet293 with domain adaptation extracted speaker embeddings for computing cosine similarity scores. In addition, we evaluated Whisper-PMFA, a pre-trained ASR model adapted for speaker verification, and found that, although it outperforms randomly initialized ResNets, it falls short of the performance of pre-trained ResNets, highlighting the importance of large-scale pre-training. The results also demonstrate that achieving competitive performance on TdSV without joint modeling of speaker and text is possible. Our best system achieved a MinDCF of 0.0358 on the evaluation subset and won the challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes