CL LGNov 22, 2024

BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

Muhammad Rafsan Kabir, Md. Mohibur Rahman Nabil, Mohammad Ashrafuzzaman Khan

arXiv:2411.15270v11.94 citationsh-index: 4ACAI

Originality Incremental advance

AI Analysis

This work addresses the problem of limited NLP resources for Bengali speakers, offering efficient models for practical applications, though it is incremental as it adapts existing distillation techniques to a new language.

The paper tackled the lack of sentence embedding models for Bengali, a low-resource language, by introducing lightweight sentence transformers using cross-lingual knowledge distillation from English models, resulting in models that outperformed existing Bangla sentence transformers across tasks like paraphrase detection and semantic textual similarity.

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.

View on arXiv PDF

Similar