AIHCFeb 12

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

arXiv:2602.12134v1
Originality Incremental advance
AI Analysis

This addresses the issue of hidden value trade-offs in LLM alignment for researchers and practitioners, though it is incremental as it builds on existing value theory and alignment methods.

The paper tackles the problem of measuring how alignment interventions in LLMs affect interconnected values, introducing the Value Alignment Tax (VAT) framework to quantify trade-offs, and finds that alignment often leads to uneven co-movement among values, revealing systemic risks.

Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes