Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
This addresses the need for AI systems adaptable to diverse user expectations by providing exact attribute intensities, though it appears incremental as it builds on existing alignment methods with targeted improvements.
The paper tackles the problem of achieving precise attribute intensity control in Large Language Models (LLMs), which current alignment methods fail to do reliably, and demonstrates that their method enables fine-grained, continuous control with high accuracy in experiments on models like LLaMA-3.2-3b and Phi-4-mini.
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control