AI LG MLJun 15, 2025

ContextBench: Modifying Contexts for Targeted Latent Activation

Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom

arXiv:2506.15735v1h-index: 2

Originality Incremental advance

AI Analysis

This work addresses safety concerns in AI by enabling targeted activation of latent features, though it is incremental as it builds on existing methods like Evolutionary Prompt Optimization.

The paper tackled the problem of generating inputs that trigger specific behaviors or latent features in language models for safety applications, and introduced ContextBench as a benchmark to evaluate methods, showing that enhanced Evolutionary Prompt Optimization variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.

View on arXiv PDF

Similar