CLSDASNov 13, 2024

Direct Speech-to-Speech Neural Machine Translation: A Survey

arXiv:2411.14453v17 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This is an incremental survey for beginners and advanced researchers in speech-to-speech translation, addressing the communication gap among communities.

This survey tackles the lack of comprehensive reviews on direct speech-to-speech translation (S2ST) models, which aim to translate speech without intermediate text, and it provides a critical analysis of their performance on benchmark datasets, noting they still lag behind cascade models in real-world translation.

Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models' performance over the benchmark datasets and provide research challenges and future directions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes