Masked Autoencoders Are Articulatory Learners
This work addresses a specific data quality issue in speech production research and technology, enabling the use of previously unusable recordings for applications like articulatory-based speech synthesis, but it is incremental as it applies an existing method to a new domain.
The paper tackled the problem of mistracked articulatory recordings in the XRMB dataset by using Masked Autoencoders to reconstruct them, achieving accurate reconstruction for 41 out of 47 speakers and retrieving 3.28 out of 3.4 hours of previously unusable data.
Articulatory recordings track the positions and motion of different articulators along the vocal tract and are widely used to study speech production and to develop speech technologies such as articulatory based speech synthesizers and speech inversion systems. The University of Wisconsin X-Ray microbeam (XRMB) dataset is one of various datasets that provide articulatory recordings synced with audio recordings. The XRMB articulatory recordings employ pellets placed on a number of articulators which can be tracked by the microbeam. However, a significant portion of the articulatory recordings are mistracked, and have been so far unsuable. In this work, we present a deep learning based approach using Masked Autoencoders to accurately reconstruct the mistracked articulatory recordings for 41 out of 47 speakers of the XRMB dataset. Our model is able to reconstruct articulatory trajectories that closely match ground truth, even when three out of eight articulators are mistracked, and retrieve 3.28 out of 3.4 hours of previously unusable recordings.