CLOct 24, 2022

Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation

arXiv:2210.13363v1293 citationsh-index: 45
Originality Synthesis-oriented
AI Analysis

This work addresses segmentation challenges for real-life online spoken language translation applications, but it is incremental as it compares existing methods rather than introducing a new paradigm.

The paper tackles the problem of audio segmentation for speech-to-text translation in continuous audio, showing that a simple fixed-window segmentation method can perform well under certain conditions, with results reported on translation quality, flicker, and delay across five language pairs.

For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes