CL ASMay 5, 2022

Cross-modal Contrastive Learning for Speech Translation

ByteDanceCMU

arXiv:2205.02444v132.6658 citationsh-index: 60Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of bridging speech and text modalities for speech translation, showing strong performance gains but is incremental as it builds on existing contrastive learning approaches.

The paper tackles the problem of learning unified representations for speech and text to improve speech translation, proposing ConST, a cross-modal contrastive learning method that achieves an average BLEU score of 29.4 on the MuST-C benchmark and improves cross-modal retrieval accuracy from 4% to 88%.

How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.

View on arXiv PDF Code

Similar