CLAIMar 4, 2025

Effectively Steer LLM To Follow Preference via Building Confident Directions

Amazon
arXiv:2503.02989v18 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the need for cost-effective and controllable alignment methods for LLMs, offering a novel approach beyond bidirectional steering, though it is incremental in building on existing model steering techniques.

The paper tackles the problem of aligning LLMs with human preferences by proposing a theoretical framework and a method called CONFST that steers models via modifying activations at inference time, achieving superior performance on tasks like shifting topics and styles across models like GPT-2 XL, Mistral, and Gemma-it.

Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes