SDLGASJun 25, 2022

Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms

arXiv:2206.12563v13 citationsh-index: 31
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of generating diverse emotional vocal bursts for audio synthesis applications, representing an incremental improvement over existing methods.

The paper tackled the generative emotional vocal burst task by training a conditional StyleGAN2 on mel-spectrograms, resulting in generated samples that substantially improved over the baseline for all emotions, with the worst-performing emotion (awe) achieving an FAD of 1.76 compared to the baseline of 4.81.

We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes