James Kirby

h-index44

3papers

8,476citations

3 Papers

15.3CLJun 16

Perceptual compensation for tonal context in self-supervised speech models

James Kirby, Ioana Krehan, Michele Gubian

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

9.6CLJun 12, 2025

Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Michele Gubian, Ioana Krehan, Oli Liu et al.

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

7.6CLJul 16, 2015

Bias and population structure in the actuation of sound change

James Kirby, Morgan Sonderegger

Why do human languages change at some times, and not others? We address this longstanding question from a computational perspective, focusing on the case of sound change. Sound change arises from the pronunciation variability ubiquitous in every speech community, but most such variability does not lead to change. Hence, an adequate model must allow for stability as well as change. Existing theories of sound change tend to emphasize factors at the level of individual learners promoting one outcome or the other, such as channel bias (which favors change) or inductive bias (which favors stability). Here, we consider how the interaction of these biases can lead to both stability and change in a population setting. We find that population structure itself can act as a source of stability, but that both stability and change are possible only when both types of bias are active, suggesting that it is possible to understand why sound change occurs at some times and not others as the population-level result of the interplay between forces promoting each outcome in individual speakers. In addition, if it is assumed that learners learn from two or more teachers, the transition from stability to change is marked by a phase transition, consistent with the abrupt transitions seen in many empirical cases of sound change. The predictions of multiple-teacher models thus match empirical cases of sound change better than the predictions of single-teacher models, underscoring the importance of modeling language change in a population setting.