Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge
This work addresses performance instability and under-utilization in computational paralinguistics tasks for speech analysis, but it is incremental as it builds on existing end-to-end methods with task-specific tweaks.
The paper tackled the problem of improving performance on three INTERSPEECH 2020 Computational Paralinguistics Challenge tasks by using ensembles of end-to-end neural network models and task-specific modifications, resulting in outperforming single models and achieving competitive or enhanced results compared to baselines.
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.