CLSDASJun 11, 2021

Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

arXiv:2106.06636v1717 citations
Originality Highly original
AI Analysis

This addresses the challenge of low-latency, high-quality speech translation for real-time applications, offering a novel hybrid approach that improves over existing cascaded and end-to-end methods.

The paper tackles the problem of simultaneous speech-to-text translation by proposing a new paradigm that synchronizes streaming ASR and direct translation decoders, using ASR results to guide translation without error propagation. Experiments on MuSTC dataset for En-to-De and En-to-Es show substantially better translation quality at similar latency levels.

Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes