SDAIASJun 25, 2022

Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction

CMUMeta AI
arXiv:2206.12568v14 citationsh-index: 58
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of predicting multiple attributes from vocal bursts for applications in affective computing, but it is incremental as it builds on existing methods and datasets.

The paper tackled the simultaneous prediction of age, country, and emotion from vocal bursts using a multitask approach with spectro-temporal modulation and self-supervised features, achieving a test score of 0.412 on the ExVo-MultiTask challenge.

This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes