CLAug 28, 2024

Legilimens: Practical and Unified Content Moderation for Large Language Model Services

arXiv:2408.15488v218 citationsh-index: 10
AI Analysis

This addresses a crucial safety concern for LLM service providers by improving moderation effectiveness and efficiency, though it appears incremental as it builds on existing methods with enhancements like data augmentation and theoretical analysis.

The paper tackles the problem of unsafe content generation by large language models (LLMs) by proposing Legilimens, a practical and unified content moderation framework that achieves effective and efficient moderation, with extensive experiments showing superior performance against commercial and academic baselines.

Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes