ASCLLGSDOct 25, 2023

Generative Pre-training for Speech with Flow Matching

arXiv:2310.16338v268 citationsh-index: 41
AI Analysis

This work proposes a foundational generative model for speech that could benefit various speech processing applications, though it appears incremental as it builds on existing generative techniques.

The authors tackled the lack of a general-purpose generative model for speech by pre-training SpeechFlow on 60k hours of untranscribed speech using Flow Matching, and showed it can be fine-tuned to match or surpass expert models on tasks like speech enhancement, separation, and synthesis.

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes