Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
This work addresses the challenge of improving automated red teaming for AI safety by enhancing prompt diversity and effectiveness, though it is incremental as it builds on an existing method.
The paper tackled the problem of generating diverse and effective adversarial prompts for automated red teaming by introducing Ruby Teaming, which incorporates a memory cache to enhance prompt quality and diversity, resulting in a 20% higher attack success rate (74% ASR) and improvements of 6% and 3% in diversity metrics compared to the baseline.
We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.