CLAIOct 20, 2020

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

arXiv:2010.10048v21003 citations
Originality Highly original
AI Analysis

This addresses the problem of real-time, fluent speech translation for users needing low-latency communication, representing a novel advancement beyond single-sentence scenarios.

The paper tackled the challenge of simultaneous speech-to-speech translation across continuous sentences, where existing methods suffer from latency accumulation and unnatural pauses due to varying speech rates. The proposed Self-Adaptive Translation method achieved more fluent target speech and substantially lower latency while maintaining similar translation quality, as demonstrated in Zh<->En directions with metrics like BLEU and MOS.

Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as measured by the naturalness metric MOS) with substantially lower latency than the baseline, in both Zh <-> En directions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes