CLAIASJun 3, 2025

Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning

arXiv:2506.02584v14 citationsh-index: 4INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding prosodic structure in speech processing for researchers, but it is incremental as it builds on existing self-supervised learning methods.

The study investigated how prosody contributes to speech structure independently of lexical content using self-supervised learning, finding that their Masked Prosody Model provides strong gains over classical features, especially for long-term structures like emotion recognition.

People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the value of complex SSL-encoded structures compared to more constrained classical structures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes