Investigating Cross-Domain Losses for Speech Enhancement
This work addresses speech enhancement for applications requiring improved intelligibility and quality, but it appears incremental as it builds on existing representations without a major breakthrough.
The paper tackled the problem of speech enhancement by investigating time-domain and time-frequency representations separately for intelligibility and quality, and introduced two new cross-domain frameworks that combine their benefits, showing merit through quantitative comparative analysis against recent methods.
Recent years have seen a surge in the number of available frameworks for speech enhancement (SE) and recognition. Whether model-based or constructed via deep learning, these frameworks often rely in isolation on either time-domain signals or time-frequency (TF) representations of speech data. In this study, we investigate the advantages of each set of approaches by separately examining their impact on speech intelligibility and quality. Furthermore, we combine the fragmented benefits of time-domain and TF speech representations by introducing two new cross-domain SE frameworks. A quantitative comparative analysis against recent model-based and deep learning SE approaches is performed to illustrate the merit of the proposed frameworks.