LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
This work addresses the need for faster and more efficient sentence embeddings for multilingual applications, though it is incremental as it builds on existing large-scale models like LaBSE.
The paper tackled the problem of large-scale language-agnostic sentence embedding models suffering from inference speed and computation overhead by proposing LEALLA, a lightweight model using a thin-deep encoder and knowledge distillation, achieving effective performance on benchmarks like Tatoeba, United Nations, and BUCC for 109 languages.
Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.