ASAISDJun 8, 2023

VIFS: An End-to-End Variational Inference for Foley Sound Synthesis

arXiv:2306.05004v16 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses Foley sound synthesis for audio production and multimedia applications, but it is incremental as it builds on existing speech synthesis techniques.

The paper tackled the problem of generating diverse Foley sounds from a single category index by adapting a text-to-speech model with variational inference, achieving high-quality sound synthesis through modifications to enhance consistency and variance.

The goal of DCASE 2023 Challenge Task 7 is to generate various sound clips for Foley sound synthesis (FSS) by "category-to-sound" approach. "Category" is expressed by a single index while corresponding "sound" covers diverse and different sound examples. To generate diverse sounds for a given category, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and speaker identity, the category-to-sound problem requires generating diverse sounds just from a category index. To compensate for the difference while maintaining consistency within each audio clip, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional Gaussian on the prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-to-end Foley sound synthesis, which generates diverse high-quality sounds.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes