ASAILGJul 15, 2025

JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs

arXiv:2507.11636v11 citationsh-index: 1WASPAA
Originality Incremental advance
AI Analysis

This work addresses speech quality assessment for applications like audio processing, but it is incremental as it builds on existing contrastive learning and pretraining methods.

The authors tackled speech quality assessment by proposing JSQA, a two-stage framework that uses perceptually-inspired contrastive pretraining on just noticeable difference (JND) audio pairs, followed by fine-tuning for mean opinion score (MOS) prediction, resulting in significant performance improvements compared to training from scratch.

Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to leverage perceptual quality similarity information and map it into an embedding space. The JND pairs come from clean LibriSpeech utterances that are mixed with background noise from CHiME-3, at different signal-to-noise ratios (SNRs). The encoder is later fine-tuned with audio samples from the NISQA dataset for MOS prediction. Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining. These findings suggest that incorporating perceptual factors into pretraining greatly contributes to the improvement in performance for SQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes