AS AI LG SDOct 24, 2020

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Henry Zhou, Alexei Baevski, Michael Auli

arXiv:2010.14230v15.912 citations

Originality Synthesis-oriented

AI Analysis

This work addresses speech representation learning for researchers, but it is incremental as it compares existing methods.

The paper compared vq-vae and vq-wav2vec for speech representation learning, finding that vq-wav2vec, based on future time-step prediction, performed better, achieving a 13.22 error rate on the ZeroSpeech 2019 ABX challenge.

Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge

View on arXiv PDF

Similar