ASCLSDFeb 6, 2024

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

arXiv:2402.03710v28 citationsh-index: 45IEEE J Sel Top Signal Process
Originality Incremental advance
AI Analysis

This work addresses the need for enhanced auditory experiences by allowing users to remix soundscapes via text, offering a user-friendly solution for managing desirable and undesirable sounds in environments like homes or public spaces, though it appears incremental as it builds on existing multimodal and audio processing techniques.

The paper tackles the problem of limited user control over sound mixtures in daily life by introducing 'Listen, Chat, and Remix' (LCR), a text-guided sound remixer that enables simultaneous control of multiple sound sources without separation, resulting in significant improvements in signal quality across tasks like extraction, removal, and volume control.

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Remix" (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources. An audio demo is available at: https://listenchatremix.github.io/demo.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes