ASCLLGSDFeb 21

[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

arXiv:2602.18899v13 citationsHas Code
Originality Incremental advance
AI Analysis

This provides insights into how speech models structure phonetic information, which could aid in improving speech technology and linguistic analysis, though it is incremental in exploring model interpretability.

The study analyzed self-supervised speech models across 96 languages and found that they encode phonological features as linear directions in representation space, enabling operations like vector arithmetic to transform sounds, such as deriving [b] from [p] using a voicing vector.

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes