SDAIMar 13

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

arXiv:2603.1283767.7
AI Analysis

This addresses the problem of extracting a target speaker's voice from mixtures for applications like speech enhancement, offering a hybrid approach that balances speed and quality, though it is incremental as it builds on existing paradigms.

The paper tackles target speaker extraction from overlapping speech by proposing Mask2Flow-TSE, a two-stage framework that combines discriminative masking and flow matching to achieve high-quality speech extraction in a single inference step, with experiments showing comparable performance to existing generative methods using about 85M parameters.

Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes