Representation Learning with Graph Neural Networks for Speech Emotion Recognition
This addresses noise robustness in speech emotion recognition, offering a more efficient solution for applications like human-computer interaction, though it is incremental as it adapts GNNs to a specific domain.
The paper tackled the problem of noise interference in speech emotion recognition by proposing a cosine similarity-based graph convolutional network, which outperformed state-of-the-art methods or achieved competitive results with a 30x reduction in model parameters.
Learning expressive representation is crucial in deep learning. In speech emotion recognition (SER), vacuum regions or noises in the speech interfere with expressive representation learning. However, traditional RNN-based models are susceptible to such noise. Recently, Graph Neural Network (GNN) has demonstrated its effectiveness for representation learning, and we adopt this framework for SER. In particular, we propose a cosine similarity-based graph as an ideal graph structure for representation learning in SER. We present a Cosine similarity-based Graph Convolutional Network (CoGCN) that is robust to perturbation and noise. Experimental results show that our method outperforms state-of-the-art methods or provides competitive results with a significant model size reduction with only 1/30 parameters.