Separate What You Describe: Language-Queried Audio Source Separation
This addresses the challenge of complex natural language descriptions in audio source separation, enabling more flexible and intuitive querying for applications like audio editing or assistive technologies, though it is incremental as it builds on existing audio and language processing methods.
The paper tackles the problem of separating a target audio source from a mixture using natural language queries, such as 'a man tells a joke followed by people laughing', by proposing LASS-Net, which achieves considerable improvements over baseline methods and shows promising generalization with human-annotated descriptions.
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.