In-Context Prompt Editing For Conditional Audio Generation
This addresses a domain-specific challenge in conditional audio generation for real-world applications, but the approach appears incremental as it builds on existing methods for prompt editing.
The paper tackles the problem of audio quality degradation in text-to-audio generation due to distributional shift from unseen user prompts, and presents a retrieval-based in-context prompt editing framework that enhances audio quality by leveraging training captions as exemplars.
Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.