CLAug 22, 2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot

arXiv:2308.11466v217.8129 citationsh-index: 48Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient, language-agnostic representations for multimodal applications, though it is incremental in building on existing embedding and translation methods.

The authors tackled the problem of creating a unified multilingual and multimodal sentence embedding space, resulting in SONAR, which outperforms existing sentence and speech encoders on similarity search tasks and achieves competitive text-to-text and speech-to-text translation results, including zero-shot scenarios.

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.

View on arXiv PDF Code

Similar