Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation
This work addresses the problem of extracting a specific speaker from mixed speech in large ad-hoc microphone arrays, which is important for applications like speaker tracing.
This paper introduces a novel deep ad-hoc beamforming approach for target-dependent speech separation, a previously unexplored area for ad-hoc microphone arrays. The method effectively extracts a target speaker from mixed speech, demonstrating its efficacy on the WSJ0-adhoc corpus.
Recently, the research on ad-hoc microphone arrays with deep learning has drawn much attention, especially in speech enhancement and separation. Because an ad-hoc microphone array may cover such a large area that multiple speakers may locate far apart and talk independently, target-dependent speech separation, which aims to extract a target speaker from a mixed speech, is important for extracting and tracing a specific speaker in the ad-hoc array. However, this technique has not been explored yet. In this paper, we propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning. The algorithm contains three components. First, we propose a supervised channel selection framework based on speaker extraction, where the estimated utterance-level SNRs of the target speech are used as the basis for the channel selection. Second, we apply the selected channels to a deep learning based MVDR algorithm, where a single-channel speaker extraction algorithm is applied to each selected channel for estimating the mask of the target speech. We conducted an extensive experiment on a WSJ0-adhoc corpus. Experimental results demonstrate the effectiveness of the proposed method.