CLSDASDec 12, 2022

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features

arXiv:2212.05805v11 citationsh-index: 21
Originality Incremental advance
AI Analysis

This enables speech-to-speech translation without textual data, reducing annotation costs for applications like simultaneous interpretation.

The paper tackles the problem of speech-to-speech translation without textual annotation by using bottleneck features as intermediate training objectives, achieving performance that matches a cascaded system in translation and synthesis quality on Mandarin-Cantonese tasks.

Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation. State-of-art models usually contains an auxiliary module for phoneme sequences prediction, and this requires textual annotation of the training dataset. We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Instead of introducing an auxiliary phoneme prediction task in the model, we propose to use bottleneck features as intermediate training objectives for our model to ensure the translation performance of the system. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach and the performance can match a cascaded system with respect of translation and synthesis qualities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes