CLASJun 12, 2025

Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

arXiv:2506.10855v13 citationsh-index: 45
Originality Synthesis-oriented
AI Analysis

This provides insights into self-supervised speech model generalization across languages, though it is incremental as it extends existing analysis methods to non-English data.

The study investigated how wav2vec2 models trained on different languages encode phonetic, tonal, and speaker information, finding that these subspaces are largely orthogonal and representation structures are largely independent of pretraining language.

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes