SDAIASJan 15, 2025

A Non-autoregressive Model for Joint STT and TTS

arXiv:2501.09104v22 citationsh-index: 40ICASSP
AI Analysis

This work addresses the challenge of integrating STT and TTS for more efficient speech processing systems, though it appears incremental as it builds on existing non-autoregressive and multimodal approaches.

The paper tackles the problem of jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) by developing a non-autoregressive multimodal framework that can handle speech and text inputs individually or together, and it shows that the model outperforms STT-specific baselines and performs competitively with TTS-specific baselines across various metrics.

In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes