Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion
This work addresses the challenge of enhancing ASR accuracy, particularly for low-resource scenarios, by reintroducing articulatory features with modern architectures, though it is incremental as it builds on prior shallow model approaches.
The paper tackles the problem of improving automatic speech recognition (ASR) by integrating articulatory features, which had been underutilized in deep learning models, and demonstrates consistent improvements over transformer-based baselines, especially in low-resource conditions on the LibriSpeech dataset.
Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.