CLFeb 16, 2023

LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

arXiv:2302.08387v2273 citationsh-index: 14
AI Analysis

This work addresses the need for faster and more efficient sentence embeddings for multilingual applications, though it is incremental as it builds on existing large-scale models like LaBSE.

The paper tackled the problem of large-scale language-agnostic sentence embedding models suffering from inference speed and computation overhead by proposing LEALLA, a lightweight model using a thin-deep encoder and knowledge distillation, achieving effective performance on benchmarks like Tatoeba, United Nations, and BUCC for 109 languages.

Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes