SD AI ASJan 30, 2024

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

arXiv:2401.17230v221.854 citationsh-index: 27Has CodeINTERSPEECH

Originality Synthesis-oriented

AI Analysis

This toolkit addresses the need for accessible and reproducible speaker recognition tools for researchers, though it is incremental as it builds on existing methods.

The paper introduces ESPnet-SPK, a toolkit for training speaker embedding extractors, providing an open-source platform with models like x-vector and SKA-TDNN, and demonstrates its versatility by achieving a 0.39% equal error rate on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

View on arXiv PDF Code

Similar