DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment
This addresses the problem of modality alignment in speech translation for researchers and practitioners, offering a faster and more accurate method that works without language-specific tools, though it is incremental over prior alignment techniques.
The paper tackles the modality gap in End-to-End Speech Translation by adapting Dynamic Time Warping to align speech and text embeddings during training, resulting in more accurate alignments, comparable translation performance, and significant speed improvements, with outperformance in low-resource settings on 5 out of 6 language directions.
End-to-End Speech Translation (E2E-ST) is the task of translating source speech directly into target text bypassing the intermediate transcription step. The representation discrepancy between the speech and text modalities has motivated research on what is known as bridging the modality gap. State-of-the-art methods addressed this by aligning speech and text representations on the word or token level. Unfortunately, this requires an alignment tool that is not available for all languages. Although this issue has been addressed by aligning speech and text embeddings using nearest-neighbor similarity search, it does not lead to accurate alignments. In this work, we adapt Dynamic Time Warping (DTW) for aligning speech and text embeddings during training. Our experiments demonstrate the effectiveness of our method in bridging the modality gap in E2E-ST. Compared to previous work, our method produces more accurate alignments and achieves comparable E2E-ST results while being significantly faster. Furthermore, our method outperforms previous work in low resource settings on 5 out of 6 language directions.