CLFeb 26, 2024

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

arXiv:2402.16444v261 citationsh-index: 25Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the need for aligned, customizable, and explainable safety detection in LLMs, offering a tool for researchers and practitioners, though it is incremental as it builds on existing LLM-based methods.

The authors tackled the problem of detecting safety issues in Large Language Models' responses by proposing ShieldLM, an LLM-based safety detector that aligns with standards, supports customization, and provides explanations, achieving superior performance over baselines across four test sets.

The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes