LGAICLMar 6

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

arXiv:2603.06495v13 citations
Predicted impact top 10% in LG · last 90 daysOriginality Highly original
AI Analysis

This work addresses the trade-off between sample efficiency and signal capture in activation steering methods, offering a more efficient way to control LLM behavior for researchers and practitioners working with LLMs.

COLD-Steer is a training-free framework that steers large language models by approximating the representational changes from gradient descent on in-context examples. It achieves up to 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes