CL LG SD ASNov 1, 2018

Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

arXiv:1811.00403v271 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of building zero-resource speech systems for search and discovery in settings where only unlabelled speech is available, representing an incremental advance over prior unsupervised methods that still rely on some supervision.

The paper tackles the problem of learning acoustic word embeddings without any supervision, such as word or phoneme boundaries, by proposing the EncDec-CAE model that uses automatically discovered word pairs from unsupervised term discovery. It achieves a 24% relative improvement in average precision over the closest competitor in word discrimination tasks on two languages.

We investigate unsupervised models that can map a variable-duration speech segment to a fixed-dimensional representation. In settings where unlabelled speech is the only available resource, such acoustic word embeddings can form the basis for "zero-resource" speech search, discovery and indexing systems. Most existing unsupervised embedding methods still use some supervision, such as word or phoneme boundaries. Here we propose the encoder-decoder correspondence autoencoder (EncDec-CAE), which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the EncDec-CAE is trained to reconstruct one word given the other as input. We compare it to a standard encoder-decoder autoencoder (AE), a variational AE with a prior over its latent embedding, and downsampling. EncDec-CAE outperforms its closest competitor by 24% relative in average precision on two languages in a word discrimination task.

View on arXiv PDF

Similar