Christian Schüldt

1.2ASFeb 16

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Fengyuan Cao, Xinyu Liang, Fredrik Cumlin et al.

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental results show that leveraging high-frequency information overlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited.

4.8SDOct 1, 2015

Noise robust integration for blind and non-blind reverberation time estimation

Christian Schüldt, Peter Händel

The estimation of the decay rate of a signal section is an integral component of both blind and non-blind reverberation time estimation methods. Several decay rate estimators have previously been proposed, based on, e.g., linear regression and maximum-likelihood estimation. Unfortunately, most approaches are sensitive to background noise, and/or are fairly demanding in terms of computational complexity. This paper presents a low complexity decay rate estimator, robust to stationary noise, for reverberation time estimation. Simulations using artificial signals, and experiments with speech in ventilation noise, demonstrate the performance and noise robustness of the proposed method.

Christian Schüldt

2 Papers