CLLGSDASFeb 3, 2023

Efficient Domain Adaptation for Speech Foundation Models

arXiv:2302.01496v131 citationsh-index: 69
Originality Incremental advance
AI Analysis

This work addresses the problem of reducing data and parameter requirements for adapting speech recognition systems to new domains, which is incremental as it builds on existing foundation model techniques.

The paper tackled efficient domain adaptation for speech foundation models by proposing a method that uses joint finetuning with source and unsupervised target data, followed by adapter and decoder finetuning, achieving the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters compared to a baseline requiring 300M supervised data and 731.1M parameters.

Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes