The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement
This highlights a critical issue for researchers and practitioners in speech enhancement, as it warns against relying solely on instrumental metrics for model evaluation, though it is incremental in demonstrating a known risk.
The paper tackles the problem of speech enhancement models overfitting to evaluation metrics like PESQ, showing that a model optimized for PESQ achieves 3.82 PESQ on VB-DMD but performs poorly in listening tests, indicating misleading performance claims.
To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.