SD CL LG ASJan 10, 2025

Towards Early Prediction of Self-Supervised Speech Model Performance

Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève

arXiv:2501.05966v214.35 citationsh-index: 19INTERSPEECH

Originality Incremental advance

AI Analysis

This addresses the resource-intensive evaluation challenge in SSL for speech processing, offering a more efficient way to gauge model quality during pre-training.

The paper tackles the problem of predicting the downstream performance of self-supervised speech models during pre-training, proposing unsupervised methods based on cluster quality and rank of embeddings that correlate better with performance than pre-training loss, reducing GPU hours and labeled data needs.

In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.

View on arXiv PDF

Similar