SDLGASSep 12, 2024

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

arXiv:2409.07841v314 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses speaker extraction for audio processing applications, presenting an incremental improvement by integrating existing techniques like WavLM and HiFi-GAN.

The paper tackled target speaker extraction by proposing TSELM, a network that uses discrete tokens and language models to convert audio generation into a classification task, achieving excellent speech quality and comparable speech intelligibility in experiments.

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes