SD LG MM ASSep 2, 2024

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

arXiv:2409.01352v14.91 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses speaker extraction for audio processing applications, offering a significant but incremental advance over existing methods.

The paper tackles target speaker extraction from multi-speaker audio by proposing a transformer-based model with speaker embedding consistency and waveform invertibility objectives, achieving a 4.1 dB improvement over state-of-the-art methods.

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.

View on arXiv PDF Code

Similar