Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
This addresses the challenge of reliable model alignment for users of LLMs, offering a training-free solution, though it appears incremental as it builds on existing activation engineering methods.
The paper tackles the problem of noise and semantic drift in activation steering for controlling Large Language Models, proposing GER-steer to refine steering vectors using global evolutionary signals, which outperforms baselines with superior efficacy and generalization.
Activation engineering enables precise control over Large Language Models (LLMs) without the computational cost of fine-tuning. However, existing methods deriving vectors from static activation differences are susceptible to high-dimensional noise and layer-wise semantic drift, often capturing spurious correlations rather than the target intent. To address this, we propose Global Evolutionary Refined Steering (GER-steer), a training-free framework that grounded in the geometric stability of the network's representation evolution. GER-steer exploits this global signal to rectify raw steering vectors, effectively decoupling robust semantic intent from orthogonal artifacts. Extensive evaluations confirm that GER-steer consistently outperforms baselines, delivering superior efficacy and generalization without layer-specific tuning, establishing a universal solution for reliable model alignment.