AS CL SDJul 2, 2020

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

arXiv:2007.00991v126.1126 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving unsupervised speech representation learning for researchers and practitioners, though it is incremental as it builds on existing CPC methods.

The authors tackled the problem of underperformance in Contrastive Predictive Coding (CPC) for speech representation learning by introducing WavAugment, a time-domain data augmentation library, which improved CPC performance by 18-22% relative, beating reference results with 600 times less data and achieving state-of-the-art on benchmarks.

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

View on arXiv PDF Code

Similar