CL HC MM SD ASDec 23, 2023

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen

arXiv:2312.15185v125.0356 citationsHas CodeACL

Originality Incremental advance

AI Analysis

This work addresses the need for a universal emotion representation model in speech processing, filling a gap in the field, though it appears incremental as it builds on existing self-supervised techniques.

The authors tackled the problem of speech emotion representation by proposing emotion2vec, a self-supervised pre-trained model that outperforms state-of-the-art models on the IEMOCAP dataset and shows consistent improvements across 10 languages and other emotion tasks.

We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.

View on arXiv PDF Code

Similar