SD CL ASNov 1, 2023

In-Context Prompt Editing For Conditional Audio Generation

Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra

arXiv:2311.00895v15.84 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This addresses a domain-specific challenge in conditional audio generation for real-world applications, but the approach appears incremental as it builds on existing methods for prompt editing.

The paper tackles the problem of audio quality degradation in text-to-audio generation due to distributional shift from unseen user prompts, and presents a retrieval-based in-context prompt editing framework that enhances audio quality by leveraging training captions as exemplars.

Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.

View on arXiv PDF

Similar