SDLGASApr 21, 2023

A vector quantized masked autoencoder for speech emotion recognition

arXiv:2304.11117v135 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the challenge of data scarcity in speech emotion recognition, which is important for applications like human-computer interaction, but it appears incremental as it builds on existing masked autoencoder and vector quantization techniques.

The paper tackles the problem of limited labeled data in speech emotion recognition by proposing a self-supervised model called VQ-MAE-S, which outperforms state-of-the-art methods after pre-training on VoxCeleb2 and fine-tuning on emotional speech data.

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes