SDAICLLGASAug 29, 2025

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

arXiv:2509.00230v2h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses speaker identification for speech processing applications, but it is incremental as it applies existing models to a specific task without introducing new methods.

This study evaluated Wav2Vec 2.0, XLS-R, and Whisper for speaker identification, finding that Wav2Vec 2.0 and XLS-R capture speaker features in early layers with improved stability after fine-tuning, while Whisper performed better in deeper layers, and determined optimal transformer layer counts for each model.

This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes