CL LG SD ASMar 20, 2023

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

arXiv:2303.11131v12.916 citationsh-index: 41

Originality Incremental advance

AI Analysis

This work addresses the limitation of existing self-supervised models that only handle single-source speech, enabling better performance in real-world scenarios with overlapping speakers, though it is an incremental improvement over prior frameworks.

The paper tackles the problem of self-supervised learning for mixture speech, which typically contains multiple speakers, by introducing Cocktail HuBERT with a masked pseudo source separation objective, achieving a 69% lower WER on multi-speaker ASR and 31% lower DER on diarization compared to state-of-the-art methods.

Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.

View on arXiv PDF

Similar