Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals
This work addresses the practical challenge of speaker extraction with short reference speech samples, which is significant for real-world applications where long enrollment speeches are impractical.
This paper proposes a multi-stage speaker extraction technique that leverages short reference speech samples by using the extracted speech from early stages as references for later stages. It introduces frame-level sequential speech embeddings as a novel reference signal and a signal fusion scheme to combine multi-scale decoded signals. Experiments on WSJ0-2mix, WHAM!, and WHAMR! datasets demonstrate that the proposed SpEx++ consistently outperforms existing state-of-the-art baselines.
Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.