CLAISDASDec 18, 2023

Soft Alignment of Modality Space for End-to-end Speech Translation

arXiv:2312.10952v110 citationsh-index: 14ICASSP
Originality Incremental advance
AI Analysis

This addresses modality alignment for speech translation, offering improved performance but is incremental over existing alignment methods.

The paper tackled the problem of modality differences hindering cross-modal and cross-lingual transfer in end-to-end speech translation by introducing Soft Alignment (S-Align) with adversarial training, resulting in outperforming hard alignment methods on three languages from the MuST-C dataset and achieving translation capabilities comparable to specialized models.

End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes