SDLGASJul 23, 2023

Self-Supervised Learning for Audio-Based Emotion Recognition

arXiv:2307.12343v15 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the scarcity of training labels in affective computing for applications such as mental healthcare and gaming, but it is incremental as it adapts existing self-supervised methods to encoded rather than raw data.

The paper tackled the problem of limited labeled data for audio-based emotion recognition by applying self-supervised learning pre-training to encoded acoustic data, resulting in consistent performance improvements across all metrics, particularly for emotions like happy, sad, and anger when training examples are small.

Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes