SDAICLASJun 10, 2024

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

arXiv:2406.06097v133 citations
Originality Incremental advance
AI Analysis

This work addresses the real-world need for real-time speech translation in streaming scenarios, which is an incremental step as it adapts existing simultaneous translation concepts to a new, more challenging setting.

The paper tackles the problem of streaming speech-to-text translation (StreamST) by introducing StreamAtt, the first StreamST policy, and StreamLAAL, a new latency metric, to handle continuous audio streams with limited history retention. Experiments on MuST-C v1.0 across 8 languages show its effectiveness compared to naive baselines and state-of-the-art SimulST methods.

Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes