Soft Alignment of Modality Space for End-to-end Speech Translation
This addresses modality alignment for speech translation, offering improved performance but is incremental over existing alignment methods.
The paper tackled the problem of modality differences hindering cross-modal and cross-lingual transfer in end-to-end speech translation by introducing Soft Alignment (S-Align) with adversarial training, resulting in outperforming hard alignment methods on three languages from the MuST-C dataset and achieving translation capabilities comparable to specialized models.
End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.