SD LG ASSep 12, 2024

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

arXiv:2409.07841v315.614 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses speaker extraction for audio processing applications, presenting an incremental improvement by integrating existing techniques like WavLM and HiFi-GAN.

The paper tackled target speaker extraction by proposing TSELM, a network that uses discrete tokens and language models to convert audio generation into a classification task, achieving excellent speech quality and comparable speech intelligibility in experiments.

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

View on arXiv PDF Code

Similar