CL AI SD ASDec 18, 2023

Soft Alignment of Modality Space for End-to-end Speech Translation

Yuhao Zhang, Kaiqi Kou, Bei Li, Chen Xu, Chunliang Zhang, Tong Xiao, Jingbo Zhu

arXiv:2312.10952v12.110 citationsh-index: 14ICASSP

Originality Incremental advance

AI Analysis

This addresses modality alignment for speech translation, offering improved performance but is incremental over existing alignment methods.

The paper tackled the problem of modality differences hindering cross-modal and cross-lingual transfer in end-to-end speech translation by introducing Soft Alignment (S-Align) with adversarial training, resulting in outperforming hard alignment methods on three languages from the MuST-C dataset and achieving translation capabilities comparable to specialized models.

End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

View on arXiv PDF

Similar