What Do I Hear? Generating Sounds for Visuals with ChatGPT
This addresses the challenge of crafting convincing soundscapes for media creators, though it appears incremental as it builds on existing sound generation methods by incorporating language models.
The paper tackles the problem of generating realistic soundscapes for visual media by leveraging ChatGPT's reasoning capabilities to suggest sounds beyond visible elements, resulting in a workflow for creating immersive auditory environments.
This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities of language models, such as ChatGPT. In this paper, we describe our workflow, which includes creating a scene context, brainstorming sounds, and generating the sounds.