DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv:2605.3143274.3

AI Analysis

This work addresses the challenge of long-form simultaneous translation for users of SpeechLLMs, providing a training-free solution that leverages existing models.

The paper introduces Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation using off-the-shelf SpeechLLMs. DOA derives a proxy alignment from self-attention, allowing for low-latency long-form SimulST with translation quality close to offline decoding, without requiring retraining.

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

View on arXiv PDF

Similar