A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings
This research provides a strong incremental improvement for zero-resource speech processing, particularly for tasks requiring acoustic word embeddings, by demonstrating the superior performance of self-supervised features over traditional MFCCs.
This paper investigates the use of self-supervised speech representations as input features for unsupervised acoustic word embeddings (AWE). They found that frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding, and correspondence autoencoders consistently outperformed conventional MFCCs in a word discrimination task on English and Xitsonga data, with CPC showing the biggest improvement.
Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the segment level, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs. These are used as inputs to a recurrent CAE-based AWE model. In a word discrimination task on English and Xitsonga data, all three representation learning approaches outperform MFCCs, with CPC consistently showing the biggest improvement. In cross-lingual experiments we find that CPC features trained on English can also be transferred to Xitsonga.