CL LGOct 3, 2023

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

arXiv:2310.02249v10.96 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses hate speech proliferation on social media for Indian language users, but it is incremental as it applies existing methods to new data.

The paper tackled hate speech detection in low-resource Indian languages (Bengali, Assamese, Gujarati) by fine-tuning pre-trained BERT and SBERT models on HASOC 2023 datasets, achieving the highest ranking in Bengali but noting room for improvement in Assamese and Gujarati.

In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati. The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content. Leveraging the HASOC 2023 datasets, we fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech. Our findings underscore the superiority of monolingual sentence-BERT models, particularly in the Bengali language, where we achieved the highest ranking. However, the performance in Assamese and Gujarati languages signifies ongoing opportunities for enhancement. Our goal is to foster inclusive online spaces by countering hate speech proliferation.

View on arXiv PDF

Similar