SD AIDec 15, 2025

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

arXiv:2512.22148v17.01 citations

Originality Incremental advance

AI Analysis

This work addresses speaker verification for speech processing applications, presenting an incremental improvement in feature aggregation methods.

The paper tackles the problem of aggregating multi-layer representations from pre-trained speech models for speaker verification by proposing Layer Attentive Pooling (LAP) and a lightweight backend model, achieving state-of-the-art performance on the VoxCeleb benchmark with reduced training time.

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

View on arXiv PDF

Similar