SD ASJul 13, 2021

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021

Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

arXiv:2107.05899v211.713 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of learning speech representations without labeled data for researchers in zero-resource speech processing, though it is incremental as it builds on existing methods like CPC and deep clustering.

The paper tackles unsupervised speech representation learning by combining Contrastive Predictive Coding with deep clustering, achieving a 35% relative improvement in phonetic metrics and top results in syntactic metrics for the ZeroSpeech Challenge 2021.

We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top results in this challenge with the syntactic metric.

View on arXiv PDF Code

Similar