SD LG ASApr 25, 2019

Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation

arXiv:1904.11148v125.7171 citations

Originality Incremental advance

AI Analysis

This work addresses speaker separation for audio processing applications, presenting an incremental improvement by combining deep learning with CASA principles.

The paper tackles talker-independent monaural speaker separation by decomposing it into simultaneous and sequential grouping stages using a deep CASA approach, achieving state-of-the-art results on the WSJ0-2mix database with a modest model size.

We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

View on arXiv PDF

Similar