CLDec 11, 2024

BDA: Bangla Text Data Augmentation Framework

Md. Tariquzzaman, Audwit Nafi Anam, Naimul Haque, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

arXiv:2412.08753v21.93 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This addresses data scarcity for Bangla NLP tasks, but it is incremental as it applies existing augmentation methods to a specific domain.

The paper tackles the problem of data scarcity in Bangla text classification by introducing the BDA framework for data augmentation, which achieved significant F1 score improvements across five datasets, enabling models to perform equivalently with 50% of the training data compared to using 100%.

Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.

View on arXiv PDF Code

Similar