CLAIOct 16, 2025

Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

arXiv:2510.15007v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the safety concerns in LLMs by providing more accurate toxicity evaluation, which is crucial for developers and users, though it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of evaluating toxicity in large language models by addressing the limitations of single-label benchmarks, which fail to capture the multi-dimensional nature of toxic content, leading to biased detections. It introduces three multi-label benchmarks and a pseudo-label-based method, showing significant performance improvements over advanced baselines like GPT-4o and DeepSeek.

Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes