CLASMar 25, 2024

Encoding of lexical tone in self-supervised models of spoken language

arXiv:2403.16865v237 citationsh-index: 25NAACL
AI Analysis

This addresses the understanding of suprasegmental phonology encoding in AI models, which is incremental as it extends prior work on segmental features to tone.

The paper analyzed how self-supervised spoken language models encode lexical tone, using Mandarin and Vietnamese as case studies, and found that these models encode tone significantly even when trained on non-tonal languages and behave similarly to humans in perception studies.

Interpretability research has shown that self-supervised Spoken Language Models (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes