NE PEApr 21

Diversifying Toxicity Search in Large Language Models Through Speciation

arXiv:2601.2098158.4

AI Analysis

For practitioners of LLM safety, this method improves coverage of distinct failure modes in red teaming, addressing the collapse problem of existing evolutionary search.

The paper introduces ToxSearch-S, a quality-diversity extension of evolutionary prompt search for red teaming LLMs that maintains multiple high-toxicity prompt niches, achieving higher peak toxicity (0.73 vs. 0.47) and broader semantic coverage than the baseline.

Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of \textit{ToxSearch} that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. \textit{ToxSearch-S} introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show \textit{ToxSearch-S} reaching higher peak toxicity ($\approx 0.73$ vs.\ $\approx 0.47$) with a heavier tail (top-10 median $0.66$ vs.\ $0.45$) than the baseline. Speciation also yields broader semantic coverage under a topics-as-species analysis (higher effective topic diversity and larger unique topic coverage). Finally, species formed are well-separated in embedding space (mean separation ratio $\approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants.

View on arXiv PDF

Similar