Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
This addresses the problem of poor perceptual speech quality in real-world scenarios for applications like hearing aids or communication systems, but it appears incremental as it builds on existing AVSE methods with a new pipeline.
The paper tackled audio-visual speech enhancement in complex acoustic environments with interfering sounds and reverberation, achieving first place in a human subjective listening test and excellent results in objective metrics in the AVSEC-4 challenge.
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.