CLASSep 15, 2021

Learning When to Translate for Streaming Speech

arXiv:2109.07368v4642 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the challenge of real-time speech translation for applications requiring low latency, though it is incremental as it builds on existing encoder-decoder models with a novel segmentation approach.

The paper tackles the problem of determining when to generate partial translations for streaming speech, which existing fixed-duration methods often disrupt by breaking acoustic units unevenly. It proposes MoSST, a method that uses a monotonic segmentation module to detect proper speech boundaries, achieving the best trade-off between translation quality (BLEU) and latency on the MuST-C dataset.

How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes