SDAIAug 8, 2025

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

arXiv:2508.06393v11 citationsh-index: 6INTERSPEECH
Originality Highly original
AI Analysis

This addresses the limitation of requiring prior speaker knowledge in audio processing, offering a more flexible solution for applications like meeting transcription or surveillance.

The paper tackles the problem of enrollment-free speech separation and speaker diarization by introducing a dual-stage training pipeline with an overlapping spectral loss function, achieving a 71% relative improvement in DER and 69% in cpWER compared to SOTA baselines.

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes