CLDec 15, 2025

Advancing Bangla Machine Translation Through Informal Datasets

Ayon Roy, Risat Rahaman, Sadat Shibly, Udoy Saha Joy, Abdulla Al Kafi, Farig Yousuf Sadeque

arXiv:2512.13487v12.7h-index: 10Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the exclusion of millions of Bangla speakers from accessing online information by enhancing translation for informal language, though it appears incremental as it builds on existing models and datasets.

The paper tackled the problem of limited progress in Bangla machine translation by focusing on informal language, which is commonly used but neglected in existing research, and proposed improvements through developing a dataset from informal sources like social media and conversational texts.

Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.

View on arXiv PDF

Similar