CLAISDASJun 11, 2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

arXiv:2406.06937v221 citations
Originality Highly original
AI Analysis

This addresses the need for real-time, low-latency communication tools by overcoming error propagation and delay issues in cascade methods, though it is an incremental improvement in simultaneous translation frameworks.

The paper tackles the problem of simultaneous speech-to-speech translation by proposing a non-autoregressive generation framework that integrates speech-to-text and speech-to-speech tasks into a unified end-to-end system, achieving high-quality interpretation with less than 3 seconds delay and a 28 times decoding speedup in offline generation.

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes