SDAIDec 15, 2025

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

arXiv:2512.22148v11 citations
Originality Incremental advance
AI Analysis

This work addresses speaker verification for speech processing applications, presenting an incremental improvement in feature aggregation methods.

The paper tackles the problem of aggregating multi-layer representations from pre-trained speech models for speaker verification by proposing Layer Attentive Pooling (LAP) and a lightweight backend model, achieving state-of-the-art performance on the VoxCeleb benchmark with reduced training time.

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes