CLAILGDec 24, 2025

Semi-Supervised Learning for Large Language Models Safety and Content Moderation

arXiv:2512.21107v1h-index: 20
Originality Incremental advance
AI Analysis

This work addresses the challenge of acquiring large labeled datasets for LLM safety and content moderation, offering a more efficient method for researchers and practitioners, though it is incremental as it builds on existing semi-supervised techniques.

The paper tackles the problem of training safety classifiers for Large Language Models by proposing a semi-supervised learning approach that leverages both labeled and unlabeled data, demonstrating that task-specific augmentations significantly improve performance compared to general-purpose techniques.

Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes