CLAISDASSep 20, 2023

Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

arXiv:2309.11384v17 citationsh-index: 48
Originality Highly original
AI Analysis

This enables low-latency end-to-end simultaneous speech translation for real-world applications where sentence segmentation is unavailable, addressing a key bottleneck in the field.

The paper tackles the problem of simultaneous speech translation for long-form audio without oracle sentence segmentation, proposing a novel segmentation approach that uses the existing speech translation encoder-decoder architecture with ST CTC to perform segmentation and translation simultaneously without supervision or extra parameters. The result is state-of-the-art quality on diverse language pairs and data, with no additional computational cost.

Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency for quality. In this paper, we propose a novel segmentation approach for a low-latency end-to-end speech translation. We leverage the existing speech translation encoder-decoder architecture with ST CTC and show that it can perform the segmentation task without supervision or additional parameters. To the best of our knowledge, our method is the first that allows an actual end-to-end simultaneous speech translation, as the same model is used for translation and segmentation at the same time. On a diverse set of language pairs and in- and out-of-domain data, we show that the proposed approach achieves state-of-the-art quality at no additional computational cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes