CL HCFeb 11, 2024

Low-Resource Counterspeech Generation for Indic Languages: The Case of Bengali and Hindi

Mithun Das, Saurabh Kumar Pandey, Shivansh Sethi, Punyajoy Saha, Animesh Mukherjee

arXiv:2402.07262v126.7106 citationsh-index: 14Has CodeFindings

Originality Synthesis-oriented

AI Analysis

This work addresses the lack of counterspeech generation resources for low-resource languages, specifically Bengali and Hindi, which is an incremental step in combating online abuse in non-English contexts.

The paper tackles the problem of generating counterspeech for low-resource Indic languages like Bengali and Hindi by creating a benchmark dataset of 5,062 abusive speech/counterspeech pairs and implementing baseline models, finding that monolingual setups perform best and transferability is better within the same language family.

With the rise of online abuse, the NLP community has begun investigating the use of neural architectures to generate counterspeech that can "counter" the vicious tone of such abusive speech and dilute/ameliorate their rippling effect over the social network. However, most of the efforts so far have been primarily focused on English. To bridge the gap for low-resource languages such as Bengali and Hindi, we create a benchmark dataset of 5,062 abusive speech/counterspeech pairs, of which 2,460 pairs are in Bengali and 2,602 pairs are in Hindi. We implement several baseline models considering various interlingual transfer mechanisms with different configurations to generate suitable counterspeech to set up an effective benchmark. We observe that the monolingual setup yields the best performance. Further, using synthetic transfer, language models can generate counterspeech to some extent; specifically, we notice that transferability is better when languages belong to the same language family.

View on arXiv PDF Code

Similar