CL AINov 3, 2025

BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

arXiv:2511.01512v12.7h-index: 16

Originality Incremental advance

AI Analysis

This addresses the lack of resources for Bengali text detoxification, an incremental advance for online safety in a low-resource language.

The paper tackles the problem of toxic language in Bengali by proposing a novel pipeline for text detoxification, resulting in the creation of BanglaNirTox, a large-scale parallel corpus of 68,041 sentences that significantly enhances detoxification quality and consistency.

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

View on arXiv PDF

Similar