SDCLASApr 25, 2025

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

arXiv:2504.18099v1h-index: 3
Originality Incremental advance
AI Analysis

This is an incremental improvement for speech production research, aiming to develop more robust and efficient models for articulatory feature prediction.

The paper tackled predicting tongue and lip articulatory features from speech acoustics using a fixed-weight BiLSTM-CNN architecture, and it outperformed adaptive weight initialization with fewer training epochs.

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes