CLLGSDASApr 16, 2019

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

arXiv:1904.07556v259 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of low-resource speech technology and phonetic learning studies by enabling symbolic input from unlabelled speech, though it is incremental as it builds on existing methods like VQ-VAEs.

The paper tackles unsupervised acoustic unit discovery for speech synthesis by applying discrete latent-variable neural networks to unlabelled speech, achieving competitive synthesis quality compared to the ZeroSpeech 2019 challenge baseline.

For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. Unsupervised discrete subword modelling could be useful for studies of phonetic category learning in infants or in low-resource speech technology requiring symbolic input. We use an autoencoder (AE) architecture with intermediate discretisation. We decouple acoustic unit discovery from speaker modelling by conditioning the AE's decoder on the training speaker identity. At test time, unit discovery is performed on speech from an unseen speaker, followed by unit decoding conditioned on a known target speaker to obtain reconstructed filterbanks. This output is fed to a neural vocoder to synthesise speech in the target speaker's voice. For discretisation, categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs) and straight-through estimation are compared at different compression levels on two languages. Our final model uses convolutional encoding, VQ-VAE discretisation, deconvolutional decoding and an FFTNet vocoder. We show that decoupled speaker conditioning intrinsically improves discrete acoustic representations, yielding competitive synthesis quality compared to the challenge baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes