ARTI-6: Towards Six-dimensional Articulatory Speech Encoding
This work addresses the need for interpretable and efficient speech technology in domains like articulatory inversion and synthesis, though it appears incremental as it builds on existing methods with a new low-dimensional representation.
The paper tackles the problem of representing speech with a compact, physiologically grounded encoding by proposing ARTI-6, a six-dimensional articulatory framework derived from real-time MRI data, which achieves a prediction correlation of 0.87 for articulatory inversion and generates natural-sounding speech from these features.
We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.