ASCLOct 11, 2023

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

arXiv:2310.07284v422 citationsh-index: 20
Originality Highly original
AI Analysis

This provides a privacy-conscious and flexible solution for isolating speakers in audio applications, addressing limitations of traditional voiceprint-based methods.

The paper tackles the problem of target speaker extraction in noisy environments by introducing a text-guided approach that uses typed descriptions instead of voiceprints, achieving competitive performance with text cues alone and setting a new state-of-the-art when combined with pre-registered cues.

Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect." However, replicating this ability has been a significant challenge in the field of target speaker extraction (TSE). Traditional TSE approaches predominantly rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples, as well as intra-speaker variability. To address these issues, this work introduces a novel text-guided TSE paradigm named LLM-TSE. In this paradigm, a state-of-the-art large language model, LLaMA 2, processes typed text input from users to extract semantic cues. We demonstrate that textual descriptions alone can effectively serve as cues for extraction, thus addressing privacy concerns and reducing dependency on voiceprints. Furthermore, our approach offers flexibility by allowing the user to specify the extraction or suppression of a speaker and enhances robustness against intra-speaker variability by incorporating context-dependent textual information. Experimental results show competitive performance with text-based cues alone and demonstrate the effectiveness of using text as a task selector. Additionally, they achieve a new state-of-the-art when combining text-based cues with pre-registered cues. This work represents the first integration of LLMs with TSE, potentially establishing a new benchmark in solving the cocktail party problem and expanding the scope of TSE applications by providing a versatile, privacy-conscious solution.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes