LGMay 21

Steered Generation via Gradient-Based Optimization on Sparse Query Features

arXiv:2605.2304081.0

Predicted impact top 14% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners needing fine-grained, interpretable control over LLM outputs, this work provides a method that disentangles semantic features better than dense state interventions.

The paper introduces Prototype-Based Sparse Steering, which applies Sparse Autoencoders to attention query activations and uses gradient-based optimization to steer LLM generation. It achieves effective control over both logical planning (safe vs. short paths in gridworld) and stylistic nuance (cognitive complexity in educational feedback).

Latent steering exploits internal representations of Large Language Models (LLMs) to guide generation, yet interventions on dense states can entangle distinct semantic features. In this paper, we investigate attention query activations as a high-fidelity site for precise control, hypothesizing that manipulating the attention mechanism itself offers sharper steerability than general state interventions. We introduce Prototype-Based Sparse Steering, a framework that applies Sparse Autoencoders (SAEs) specifically to query activations, to decompose them into interpretable features, then apply gradient-based optimization during inference to align the sparse representation with class prototypes of target behaviors. To validate this architectural insight, we first analyze the mechanism in Textualized Gridworld, a controlled environment for verifiable planning constraints. We demonstrate that optimizing sparse query features enables effective navigation of rigid planning requirements (i.e., safe vs. short paths), confirming the method's ability to satisfy objective rules. We then demonstrate the framework's versatility by training SAEs on a high-dimensional educational domain, where the framework steers the cognitive complexity of feedback (i.e., Bloom's Taxonomy). Our experiments establish that sparse query representations provide the necessary disentanglement for unified, interpretable control over both logical planning and stylistic nuance.

View on arXiv PDF

Similar