Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
This provides an efficient and adaptive method for fine-grained control over LLM behavior at inference time, addressing safety concerns for users and developers, though it is incremental as it builds on existing activation steering techniques.
The paper tackles the problem of controlling undesirable behaviors in Large Language Models (LLMs), such as generating unsafe content, by introducing a lightweight controller network that dynamically modulates activation steering during inference, achieving significant increases in refusal rates on safety benchmarks like ToxicChat and In-The-Wild Jailbreak Prompts without altering model parameters.
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.