Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties
This addresses the issue of bias and inconsistency in AI toxicity detection for diverse language users, but it is incremental as it builds on existing LLM-as-a-judge methods.
The paper tackled the problem of how dialectal differences affect toxicity detection by large language models (LLMs), finding that LLMs are sensitive to multilingual and dialectal variations but show weakest consistency in LLM-human agreement, followed by dialectal consistency, based on evaluations across 10 language clusters and 60 varieties.
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}