LGMay 29, 2025

Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

arXiv:2505.24059v15 citationsh-index: 22
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of interpreting multimodal speech recognition models for researchers in speech processing, though it is incremental as it builds on prior studies by focusing on a single-speaker dataset.

The study tackled the problem of disentangling acoustic and articulatory contributions in phoneme recognition by using a single-speaker MRI corpus to reduce cross-speaker variability, finding that audio and multimodal models performed similarly on phonetic manner classes but diverged on places of articulation.

Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes