SDMar 11

Training-Free Multi-Step Inference for Target Speaker Extraction

arXiv:2603.10921v111.2h-index: 19
Predicted impact top 45% in SD · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses speech separation for applications like hearing aids or voice assistants, but it is incremental as it builds on existing TSE systems with a novel inference approach.

The paper tackles the problem of target speaker extraction by proposing a training-free multi-step inference method that iteratively refines speech estimates using a frozen pretrained model, achieving consistent gains in metrics like SI-SDRi when ground-truth is available and enabling controllable extraction preferences through joint metric optimization.

Target speaker extraction (TSE) aims to recover a target speaker's speech from a mixture using a reference utterance as a cue. Most TSE systems adopt conditional auto-encoder architectures with one-step inference. Inspired by test-time scaling, we propose a training-free multi-step inference method that enables iterative refinement with a frozen pretrained model. At each step, new candidates are generated by interpolating the original mixture and the previous estimate, and the best candidate is selected for further refinement until convergence. Experiments show that, when ground-truth target speech is available, optimizing an intrusive metric (SI-SDRi) yields consistent gains across multiple evaluation metrics. Without ground truth, optimizing non-intrusive metrics (UTMOS or SpkSim) improves the corresponding metric but may hurt others. We therefore introduce joint metric optimization to balance these objectives, enabling controllable extraction preferences for practical deployment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes