Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces
This work provides insights into the interpretability of speech models for researchers, but it is incremental as it builds on prior compositional findings.
The study investigated how self-supervised speech models encode phonetic context in single frame-level representations, finding that phonological information from neighboring phones is compositionally encoded with properties like orthogonality between positions and implicit phonetic boundaries.
Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.