Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
This addresses code-mixed NLP challenges for languages like Bangla, English, and Hindi, offering efficient alternatives for multilingual understanding, though it is incremental as it builds on existing BERT-based methods.
The paper tackled the problem of text classification with code-mixed texts in Bangla, English, and Hindi by introducing Tri-Distil-BERT and Mixed-Distil-BERT, which demonstrated competitive performance against larger models like mBERT and XLM-R across multiple NLP tasks.
One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.