Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech
This work addresses the need for visual articulator motion in applications like second language learning and animation, but it is incremental as it builds on existing diffusion models and pre-trained speech embeddings.
The authors tackled the problem of generating real-time MRI videos of the vocal tract from speech input, using a speech-guided diffusion model with pre-trained speech representations to improve visual generation, though limitations include unsmooth tongue motion and distortion during palate contact.
Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include the presence of unsmooth tongue motion and video distortion when the tongue contacts the palate.