AI CL CYOct 30, 2025

SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detectio

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

arXiv:2511.11599v13.3h-index: 17

Originality Synthesis-oriented

AI Analysis

This provides a scalable and ethically safe dataset for researchers and practitioners working on cyberbullying detection, though it is incremental as it builds on existing synthetic data methods.

The paper tackles the problem of cyberbullying detection by introducing SynBullying, a synthetic multi-LLM conversational dataset that simulates realistic bullying interactions, and shows it can be used for training and augmentation in classification tasks.

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

View on arXiv PDF

Similar