Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective
This work addresses phase reconstruction for speaker separation, which is a domain-specific problem in audio processing, and is incremental as it builds on existing magnitude estimation methods.
The study tackled the problem of phase reconstruction for speaker separation in the STFT domain, achieving state-of-the-art results on the wsj0-2mix and 3mix datasets by proposing three algorithms to uniquely determine phase differences and select correct candidates.
This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture of two sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in addition, the source phases at each time-frequency (T-F) unit can be narrowed down to only two candidates. To pick the right candidate, we propose three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction. State-of-the-art results are obtained on the publicly available wsj0-2mix and 3mix corpus.