Inference and Denoise: Causal Inference-based Neural Speech Enhancement
This addresses speech enhancement for audio processing applications, but it is incremental as it builds on existing causal inference and mask-based methods.
The study tackled speech enhancement by modeling noise presence as an intervention using causal inference, resulting in a method that outperformed non-causal approaches and showed better performance and efficiency than more complex models.
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models.