A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination
For researchers and practitioners in speech enhancement, this provides empirical guidance on the trade-offs between perceptual quality and computational cost.
This study compares generative and discriminative speech enhancement methods, finding that discriminative models offer better robustness and lower complexity, while generative models introduce hallucinations that degrade word error rate.
In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.