SDCLLGASAug 23, 2023

Audio Generation with Multiple Conditional Diffusion Model

arXiv:2308.11940v438 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses the problem of restricted audio controllability for users in audio generation, representing an incremental improvement by enhancing pre-trained models with supplementary conditions.

The paper tackles the limited controllability of text-based audio generation by proposing a model that incorporates additional conditions like timestamp, pitch contour, and energy contour to achieve fine-grained control over temporal order, pitch, and energy in generated audio.

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes