CLLGNov 22, 2024

BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

arXiv:2411.15270v14 citationsh-index: 4ACAI
Originality Incremental advance
AI Analysis

This work addresses the problem of limited NLP resources for Bengali speakers, offering efficient models for practical applications, though it is incremental as it adapts existing distillation techniques to a new language.

The paper tackled the lack of sentence embedding models for Bengali, a low-resource language, by introducing lightweight sentence transformers using cross-lingual knowledge distillation from English models, resulting in models that outperformed existing Bangla sentence transformers across tasks like paraphrase detection and semantic textual similarity.

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes