CLMar 4, 2024Code
OffensiveLang: A Community Based Implicit Offensive Language DatasetAmit Das, Mostafa Rahgouy, Dongji Feng et al.
The widespread presence of hateful languages on social media has resulted in adverse effects on societal well-being. As a result, addressing this issue with high priority has become very important. Hate speech or offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, the existing datasets primarily rely on the collection of texts containing explicit offensive keywords, making it challenging to capture implicitly offensive contents that are devoid of these keywords. Secondly, common methodologies tend to focus solely on textual analysis, neglecting the valuable insights that community information can provide. In this research paper, we introduce a novel dataset OffensiveLang, a community based implicit offensive language dataset generated by ChatGPT 3.5 containing data for 38 different target groups. Despite limitations in generating offensive texts using ChatGPT due to ethical constraints, we present a prompt-based approach that effectively generates implicit offensive languages. To ensure data quality, we evaluate the dataset with human. Additionally, we employ a prompt-based zero-shot method with ChatGPT and compare the detection results between human annotation and ChatGPT annotation. We utilize existing state-of-the-art models to see how effective they are in detecting such languages. The dataset is available here: https://github.com/AmitDasRup123/OffensiveLang
CLJul 30, 2025
Investigating Hallucination in Conversations for Low Resource LanguagesAmit Das, Md. Najib Hasan, Souvika Sarkar et al.
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
SDJun 20, 2024
Machine Learning Techniques in Automatic Music Transcription: A Systematic SurveyFatemeh Jamshidi, Gary Pike, Amit Das et al.
In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.
CLJun 17, 2024
Investigating Annotator Bias in Large Language Models for Hate Speech DetectionAmit Das, Zheng Zhang, Najib Hasan et al.
Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs) presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability with four LLMs: GPT-3.5, GPT-4o, Llama-3.1 and Gemma-2. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateBiasNet, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al. 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for data annotation, thereby fostering advancements in this critical field.
HCJul 16, 2020
Accessible Computer Science for K-12 Students with Hearing ImpairmentsMeenakshi Das, Daniela Marghitu, Fatemeh Jamshidi et al.
An inclusive science, technology, engineering and mathematics (STEM) workforce is needed to maintain America's leadership in the scientific enterprise. Increasing the participation of underrepresented groups in STEM, including persons with disabilities, requires national attention to fully engage the nation's citizens in transforming its STEM enterprise. To address this need, a number of initiatives, such as AccessCSforALL, Bootstrap, and CSforAll, are making efforts to make Computer Science inclusive to the 7.4 million K-12 students with disabilities in the U.S. Of special interest to our project are those K-12 students with hearing impairments. American Sign Language (ASL) is the primary means of communication for an estimated 500,000 people in the United States, yet there are limited online resources providing Computer Science instruction in ASL. This paper introduces a new project designed to support Deaf and Hard of Hearing (DHH) K-12 students and sign interpreters in acquiring knowledge of complex Computer Science concepts. We discuss the motivation for the project and an early design of the accessible block-based Computer Science curriculum to engage DHH students in hands-on computing education.