ASSDSep 16, 2025

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024

arXiv:2506.00506h-index: 9
Originality Incremental advance
AI Analysis

For researchers and engineers needing speech quality assessment under extreme data scarcity, this work shows that transfer learning with synthetic data can yield competitive results.

The authors developed a non-intrusive speech quality prediction system for noisy/enhanced speech using wav2vec 2.0 with two-stage transfer learning, achieving best BAK prediction (LCC=0.867) and second-best OVRL (LCC=0.711) in VoiceMOS 2024 Track 3, despite only 100 labeled training samples.

We present a system for non-intrusive prediction of speech quality in noisy and enhanced speech, developed for Track 3 of the VoiceMOS 2024 Challenge. The task required estimating the ITU-T P.835 metrics SIG, BAK, and OVRL without reference signals and with only 100 subjectively labeled utterances for training. Our approach uses wav2vec 2.0 with a two-stage transfer learning strategy: initial fine-tuning on automatically labeled noisy data, followed by adaptation to the challenge data. The system achieved the best performance on BAK prediction (LCC=0.867) and a very close second place in OVRL (LCC=0.711) in the official evaluation. Post-challenge experiments show that adding artificially degraded data to the first fine-tuning stage substantially improves SIG prediction, raising correlation with ground truth scores from 0.207 to 0.516. These results demonstrate that transfer learning with targeted data generation is effective for predicting P.835 scores under severe data constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes