CLSDASAug 11, 2025

Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0

arXiv:2508.08110v14 citationsh-index: 3INTERSPEECH
Originality Synthesis-oriented
AI Analysis

This addresses the under-studied problem of model architecture effects on speech representation learning for researchers, but it is incremental as it compares existing models.

The study investigated how architectural differences affect linguistic information in self-supervised speech models, finding that iterative pseudo-label refinement, not training objective, explains differences in representation correlations to word, phoneme, and speaker identity.

Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes