AI CLFeb 28, 2024

TroubleLLM: Align to Red Team Expert

Zhuoer Xu, Jianping Zhang, Shiwen Cui, Changhua Meng, Weiqiang Wang

arXiv:2403.00829v15.82 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the need for better safety testing tools for LLM developers and users, though it appears incremental as it builds on existing LLM-for-testing ideas.

The paper tackles the problem of generating high-quality and diverse test prompts for assessing safety issues in Large Language Models (LLMs), such as social biases and toxic content, by proposing TroubleLLM, which demonstrates superiority in generation quality and controllability through experiments and human evaluation.

Large Language Models (LLMs) become the start-of-the-art solutions for a variety of natural language tasks and are integrated into real-world applications. However, LLMs can be potentially harmful in manifesting undesirable safety issues like social biases and toxic content. It is imperative to assess its safety issues before deployment. However, the quality and diversity of test prompts generated by existing methods are still far from satisfactory. Not only are these methods labor-intensive and require large budget costs, but the controllability of test prompt generation is lacking for the specific testing domain of LLM applications. With the idea of LLM for LLM testing, we propose the first LLM, called TroubleLLM, to generate controllable test prompts on LLM safety issues. Extensive experiments and human evaluation illustrate the superiority of TroubleLLM on generation quality and generation controllability.

View on arXiv PDF

Similar