LGJun 4, 2025

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang

arXiv:2506.04302v17.11 citationsh-index: 4Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a reproducibility and standardization problem for researchers developing red teaming methods to identify vulnerabilities in LLMs, but it is incremental as it builds on existing RFT and benchmark frameworks.

The authors tackled the lack of a unified benchmark for Reinforcement Fine-Tuning (RFT)-based red teaming of Large Language Models, which hinders reproducibility and stability, by introducing RedRFT, a lightweight benchmark that standardizes implementation and evaluation, supported by an ablation study on key components like LoRA and KL divergence.

Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at https://github.com/x-zheng16/RedRFT.git.

View on arXiv PDF Code

Similar