CLFeb 12, 2024
Text Detoxification as Style Transfer in English and HindiSourabrata Mukherjee, Akanksha Bansal, Atul Kr. Ojha et al.
This paper focuses on text detoxification, i.e., automatically converting toxic text into non-toxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text style changes while its content is preserved. We present three approaches: knowledge transfer from a similar task, multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al.(2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxication while preserving the actual content and maintaining fluency.
CLMar 16, 2020
Developing a Multilingual Annotated Corpus of Misogyny and AggressionShiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar et al.
In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
IRJun 7, 2014
Bullseye: Structured Passage Retrieval and Document Highlighting for Scholarly SearchXi Zheng, Akanksha Bansal, Matthew Lease
We present the Bullseye system for scholarly search. Given a collection of research papers, Bullseye: 1) identifies relevant passages using any on-the-shelf algorithm; 2) automatically detects document structure and restricts retrieved passages to user-specifed sections; and 3) highlights those passages for each PDF document retrieved. We evaluate Bullseye with regard to three aspects: system effectiveness, user effectiveness, and user effort. In a system-blind evaluation, users were asked to compare passage retrieval using Bullseye vs. a baseline which ignores document structure, in regard to four types of graded assessments. Results show modest improvement in system effectiveness while both user effectiveness and user effort show substantial improvement. Users also report very strong demand for passage highlighting in scholarly search across both systems considered.