ASLGSDFeb 17, 2023

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

arXiv:2302.08639v217 citationsh-index: 56
Originality Incremental advance
AI Analysis

This work addresses the need for better local context capture in speaker verification, offering incremental improvements over existing models.

The authors tackled the problem of Transformer-based networks lacking local context for speaker verification by enhancing locality modeling, achieving 0.75% EER on VoxCeleb1 and a 14.6% relative reduction in EER on a large-scale dataset.

Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes