SDAIMar 24

MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

arXiv:2603.2304838.0h-index: 8
AI Analysis

This addresses a practical limitation in speech processing for applications with varied audio quality, though it is incremental as it builds directly on HuBERT.

The paper tackles the problem of self-supervised speech models struggling with mixed sampling rate data by proposing MSR-HuBERT, which adapts HuBERT with a multi-sampling-rate downsampling CNN to map different rates to a shared resolution without resampling. It outperforms HuBERT on speech recognition and full-band reconstruction across 16 to 48 kHz, preserving high-frequency details.

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes