SD LG ASJun 25, 2022

Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms

arXiv:2206.12563v17.13 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of generating diverse emotional vocal bursts for audio synthesis applications, representing an incremental improvement over existing methods.

The paper tackled the generative emotional vocal burst task by training a conditional StyleGAN2 on mel-spectrograms, resulting in generated samples that substantially improved over the baseline for all emotions, with the worst-performing emotion (awe) achieving an FAD of 1.76 compared to the baseline of 4.81.

We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).

View on arXiv PDF Code

Similar