Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing
This provides practical guidance for AI red team testing of production LLM agents, though it's an incremental adaptation of existing methods.
The researchers adapted Go-Explore to test the security of GPT-4o-mini LLM agents, finding that random-seed variance caused an 8x spread in outcomes and reward shaping harmed performance in 94% of runs.
Production LLM agents with tool-using capabilities require security testing despite their safety training. We adapt Go-Explore to evaluate GPT-4o-mini across 28 experimental runs spanning six research questions. We find that random-seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single-seed comparisons are unreliable, while multi-seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack-type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety-trained models.