CLAug 22, 2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

arXiv:2308.11466v2126 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses the need for efficient, language-agnostic representations for multimodal applications, though it is incremental in building on existing embedding and translation methods.

The authors tackled the problem of creating a unified multilingual and multimodal sentence embedding space, resulting in SONAR, which outperforms existing sentence and speech encoders on similarity search tasks and achieves competitive text-to-text and speech-to-text translation results, including zero-shot scenarios.

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes