Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
This work addresses the problem of generating diverse emotional vocal bursts for audio synthesis applications, representing an incremental improvement over existing methods.
The paper tackled the generative emotional vocal burst task by training a conditional StyleGAN2 on mel-spectrograms, resulting in generated samples that substantially improved over the baseline for all emotions, with the worst-performing emotion (awe) achieving an FAD of 1.76 compared to the baseline of 4.81.
We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).