SDCLASApr 5, 2022

Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation

arXiv:2204.02269v17 citationsh-index: 36
AI Analysis

This work addresses speech production modeling for researchers in computational linguistics and AI, but it appears incremental as it builds on existing neural and self-supervised techniques.

The authors tackled the problem of learning acoustic-to-articulatory mapping for speech production by proposing a self-supervised computational model that combines neural synthesizers and forward/inverse models, achieving encouraging performances in imitation simulations.

We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory commands from the acoustic speech input. Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers. The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes