SDAIASJan 30, 2024

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

arXiv:2401.17230v250 citationsh-index: 27Has CodeINTERSPEECH
Originality Synthesis-oriented
AI Analysis

This toolkit addresses the need for accessible and reproducible speaker recognition tools for researchers, though it is incremental as it builds on existing methods.

The paper introduces ESPnet-SPK, a toolkit for training speaker embedding extractors, providing an open-source platform with models like x-vector and SKA-TDNN, and demonstrates its versatility by achieving a 0.39% equal error rate on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes