CLJun 13, 2024

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

arXiv:2406.09200v112 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding useful properties in speech representations for researchers in speech technology, but it is incremental as it builds on existing hypotheses.

The study investigated whether orthogonality and isotropy in self-supervised speech representations correlate with downstream task performance, finding that both measures correlate with phonetic accuracy, with isotropy showing more nuanced results.

Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes