SD CV ASApr 2, 2022

Acoustic-to-articulatory Inversion based on Speech Decomposition and Auxiliary Feature

Jianrong Wang, Jinyu Liu, Longxuan Zhao, Shanyu Wang, Ruiguo Yu, Li Liu

arXiv:2204.00873v18.314 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses the problem of limited data and performance bottlenecks in speech processing for applications like speech therapy or human-computer interaction, though it is incremental.

The paper tackles the challenge of speaker-independent acoustic-to-articulatory inversion by proposing a method that uses speech decomposition and auxiliary lip features, reducing average RMSE by 0.29 and increasing correlation coefficient by 5.0% in speaker-independent cases.

Acoustic-to-articulatory inversion (AAI) is to obtain the movement of articulators from speech signals. Until now, achieving a speaker-independent AAI remains a challenge given the limited data. Besides, most current works only use audio speech as input, causing an inevitable performance bottleneck. To solve these problems, firstly, we pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding as the new personalized speech features to adapt to the speaker-independent case. Secondly, to further improve the AAI, we propose a novel auxiliary feature network to estimate the lip auxiliary features from the above personalized speech features. Experimental results on three public datasets show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0% in the speaker-dependent case. More importantly, the average RMSE decreases by 0.29 and the average correlation coefficient increases by 5.0% in the speaker-independent case.

View on arXiv PDF

Similar