LGAIApr 27, 2025

Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors

arXiv:2504.20106v13 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses the problem of balancing safety and utility in LLMs for developers and users, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of balancing helpfulness and harmlessness in large language models, where existing methods suffer from performance conflicts and limited controllability. The proposed Preference Vector framework enables fine-grained, user-controllable adjustments and scalable multi-preference alignment without retraining.

Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes