ASLGSDMar 25, 2022

EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition

arXiv:2203.13617v210 citationsh-index: 99
AI Analysis

This work addresses the time-consuming and labor-intensive process of designing optimal models for different SER datasets, offering an automated solution for researchers and practitioners in human-computer interaction.

The paper tackles the problem of manually designing models for speech emotion recognition (SER) by proposing EmotionNAS, a two-stream neural architecture search framework that uses handcrafted and deep features, which outperforms existing models and sets a new state-of-the-art record.

Speech emotion recognition (SER) is an important research topic in human-computer interaction. Existing works mainly rely on human expertise to design models. Despite their success, different datasets often require distinct structures and hyperparameters. Searching for an optimal model for each dataset is time-consuming and labor-intensive. To address this problem, we propose a two-stream neural architecture search (NAS) based framework, called \enquote{EmotionNAS}. Specifically, we take two-stream features (i.e., handcrafted and deep features) as the inputs, followed by NAS to search for the optimal structure for each stream. Furthermore, we incorporate complementary information in different streams through an efficient information supplement module. Experimental results demonstrate that our method outperforms existing manually-designed and NAS-based models, setting the new state-of-the-art record.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes