Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque

arXiv:2603.2135955.0h-index: 1

AI Analysis

This work addresses a critical issue for users of low-resource language dialects by providing a rigorous benchmark and evaluation method, though it is incremental in building on existing bias assessment techniques.

The authors tackled the problem of performance biases in large language models against regional dialects of low-resource languages, specifically Bengali, by developing a multi-stage framework to evaluate dialectal bias in question-answering across nine dialects, revealing severe performance drops such as scores of 5.44/10 for the Chittagong dialect compared to 7.68/10 for Tangail.

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

View on arXiv PDF

Similar