LGAICLFeb 3, 2025

Eliciting Language Model Behaviors with Investigator Agents

arXiv:2502.01236v122 citationsh-index: 12ICML
Originality Highly original
AI Analysis

This work addresses the challenge of characterizing and controlling language model outputs for safety and reliability, representing a novel method for a known bottleneck.

The researchers tackled the problem of eliciting specific target behaviors from language models by searching for effective prompts, achieving a 100% attack success rate on a subset of AdvBench for harmful behaviors and an 85% hallucination rate.

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes