CLAIJun 4, 2024

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

arXiv:2406.02721v35 citations
Originality Highly original
AI Analysis

This addresses the problem of precise and adaptable control over LLM outputs for users needing safe and reliable AI interactions, representing a novel method rather than an incremental improvement.

The paper tackles the problem of controlling large language model behaviors without human supervision by proposing SelfControl, an inference-time method that uses gradients from natural language suffix strings to guide generation, achieving improvements of 8.3% in detoxification, 3.1% in truthfulness, 4-10% in emotion tone control, and 48.2% in privacy protection.

We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%~10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue. Additionally, we demonstrate that SelfControl can be used for data synthesis and to improve reasoning abilities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes