CLNov 12, 2023

Automatic Textual Normalization for Hate Speech Detection

Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen

arXiv:2311.06851v40.91 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of irregular text for NLP tools in Vietnamese social media, but it is incremental as it builds on existing normalization methods with a simpler approach.

The paper tackles the problem of non-standard words in Vietnamese social media text by proposing a sequence-to-sequence model for textual normalization, which improves hate speech detection accuracy by about 2% despite achieving under 70% normalization accuracy.

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

View on arXiv PDF Code

Similar