SDLGASMLDec 23, 2017

Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

arXiv:1712.08708v3104 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for automated feature learning in speech emotion recognition, which is incremental as it applies an existing method (VAEs) to a new domain (speech).

The authors tackled the problem of speech emotion recognition by proposing Variational Autoencoders (VAEs) to learn latent representations from speech signals, achieving state-of-the-art results on the IEMOCAP dataset.

Learning the latent representation of data in unsupervised fashion is a very interesting process that provides relevant features for enhancing the performance of a classifier. For speech emotion recognition tasks, generating effective features is crucial. Currently, handcrafted features are mostly used for speech emotion recognition, however, features learned automatically using deep learning have shown strong success in many problems, especially in image processing. In particular, deep generative models such as Variational Autoencoders (VAEs) have gained enormous success for generating features for natural images. Inspired by this, we propose VAEs for deriving the latent representation of speech signals and use this representation to classify emotions. To the best of our knowledge, we are the first to propose VAEs for speech emotion classification. Evaluations on the IEMOCAP dataset demonstrate that features learned by VAEs can produce state-of-the-art results for speech emotion classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes