SDAILGASAug 22, 2023

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

arXiv:2308.11241v22 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses speaker identification for speech processing applications, presenting an incremental improvement in efficiency and performance.

The paper tackled speaker identification by proposing an effective Transformer-based contextual model and a Temporal Gate Pooling method, achieving 87.1% accuracy on VoxCeleb1 with only 28.5M parameters, comparable to a larger model with 317.7M parameters.

Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the hyper-parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes