End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions
This work addresses the problem of speaker extraction in noisy environments, which is significant for applications such as voice assistants and conference calls, and presents an incremental improvement over existing methods.
This paper tackles the problem of extracting a desired speaker from a mixture of multiple speakers and noise in a reverberant environment, achieving better performance with the instantaneous relative transfer function (RTF) than with direction of arrival (DOA)-based spatial cue and spectral embedding. The RTF-based method outperforms the DOA-based spatial cue.
This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.