NE AI CLNov 16, 2025

Evolving Prompts for Toxicity Search in Large Language Models

arXiv:2511.12487v14.2

Originality Incremental advance

AI Analysis

This addresses safety vulnerabilities in large language models for users and developers, though it is incremental as it builds on existing red-teaming methods.

The authors tackled the problem of large language models remaining vulnerable to adversarial prompts that elicit toxic content by developing ToxSearch, a black-box evolutionary framework for testing model safety, which showed that evolved prompts can transfer across models with toxicity roughly halving on most targets.

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

View on arXiv PDF

Similar