CLAILGJul 15, 2025

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

arXiv:2507.11316v19 citationsh-index: 10Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the need for clarity, transparency, and adaptability in LLM alignment, which is crucial for safe and ethical AI deployment, though it appears incremental as it builds on existing value alignment techniques.

The paper tackles the problem of aligning large language models (LLMs) with human values by introducing a Controlled Value Vector Activation (ConVA) method that interprets and modifies internal activations, achieving the highest control success rate across 10 basic values without harming model performance or fluency.

Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes