CLApr 3, 2025
State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of BanglaSharif Md. Abdullah, Abhijit Paul, Shebuti Rayana et al.
Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.
CLOct 10, 2025
Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma LanguageAdity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis et al.
As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.
CLAug 21, 2025
Stemming -- The Evolution and Current State with a Focus on BanglaAbhijit Paul, Mashiat Amin Farin, Sharif Md. Abdullah et al.
Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
SEDec 31, 2017
SAFFRON: A Semi-Automated Framework for Software Requirements PrioritizationSyed Ali Asif, Zarif Masud, Rubaida Easmin et al.
Due to dynamic nature of current software development methods, changes in requirements are embraced and given proper consideration. However, this triggers the rank reversal problem which involves re-prioritizing requirements based on stakeholders' feedback. It incurs significant cost because of time elapsed in large number of human interactions. To solve this issue, a Semi-Automated Framework for soFtware Requirements priOritizatioN (SAFFRON) is presented in this paper. For a particular requirement, SAFFRON predicts appropriate stakeholders' ratings to reduce human interactions. Initially, item-item collaborative filtering is utilized to estimate similarity between new and previously elicited requirements. Using this similarity, stakeholders who are most likely to rate requirements are determined. Afterwards, collaborative filtering based on latent factor model is used to predict ratings of those stakeholders. The proposed approach is implemented and tested on RALIC dataset. The results illustrate consistent correlation, similar to state of the art approaches, with the ground truth. In addition, SAFFRON requires 13.5-27% less human interaction for re-prioritizing requirements.