SDIRLGASFeb 19, 2024

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

arXiv:2402.12482v14 citationsh-index: 4ICASSP
Originality Incremental advance
AI Analysis

This addresses the need for scalable clean speech acquisition for speech technologies, but it is incremental as it builds on existing speech enhancement methods.

The paper tackles the problem of acquiring clean speech data at scale for training speech enhancement models by proposing SECP, a pipeline that minimizes human annotation and uses iterative refinement. The results show that using enhanced speech as ground truth does not degrade model performance (ΔPESQ) and refined data is perceptually better than original data in subjective tests.

As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $Δ_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes