SDCLASMay 7, 2025

Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

arXiv:2505.04457v45 citationsh-index: 51WASPAA
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently cleaning million-hour speech datasets for large-scale generative models, which is an incremental improvement in data preparation for AI training.

The paper tackles the problem of cleaning large-scale speech data for training generative models by introducing Miipher-2, a speech restoration model that achieves superior or comparable performance to conventional models in metrics like word-error-rate and sound quality across multiple languages, with a real-time factor of 0.0078 enabling million-hour dataset processing in about three days using 100 accelerators.

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes