CLMay 17, 2023

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

arXiv:2305.10005v243 citations
Originality Incremental advance
AI Analysis

This work addresses speech representation learning for applications in AI and speech processing, offering a novel integration of methods but is incremental in building upon existing self-supervised techniques.

The paper tackled the problem of self-supervised speech representation learning by introducing DinoSR, which combines masked language modeling, self-distillation, and online clustering, resulting in a model that surpasses previous state-of-the-art performance in several downstream tasks.

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes