AI CLOct 18, 2024

Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li Chen, Nan Jiang

arXiv:2410.14881v211.67 citationsh-index: 36

Originality Incremental advance

AI Analysis

This addresses the need for real-time, flexible risk mitigation in content moderation for AI developers, offering a more efficient alternative to costly fine-tuning.

The paper tackles the problem of subtle distinctions in content moderation for Generative AI systems by proposing Class-RAG, a method that uses retrieval-augmented generation to outperform fine-tuning in classification and robustness, with performance scaling with retrieval library size.

Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack, as evidenced by empirical studies. Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.

View on arXiv PDF

Similar