SD AI ASJan 15, 2025

A Non-autoregressive Model for Joint STT and TTS

Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras

arXiv:2501.09104v24.02 citationsh-index: 40ICASSP

Originality Incremental advance

AI Analysis

This work addresses the challenge of integrating STT and TTS for more efficient speech processing systems, though it appears incremental as it builds on existing non-autoregressive and multimodal approaches.

The paper tackles the problem of jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) by developing a non-autoregressive multimodal framework that can handle speech and text inputs individually or together, and it shows that the model outperforms STT-specific baselines and performs competitively with TTS-specific baselines across various metrics.

In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

View on arXiv PDF

Similar