AS CL SDMar 2, 2025

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

arXiv:2503.00733v15.96 citationsh-index: 29ICLR

Originality Incremental advance

AI Analysis

This work addresses the overhead and cost of pre-training multiple foundation models in speech processing, offering a general-purpose alternative, though it is incremental as it builds on existing pre-training techniques.

The authors tackled the problem of separate foundation models for discriminative and generative speech tasks by proposing UniWav, a unified pre-training framework that jointly learns a representation encoder and generative audio decoder, achieving comparable performance to task-specific models on speech recognition, text-to-speech, and speech tokenization.

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.

View on arXiv PDF

Similar