LGFeb 1

Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models

arXiv:2602.01128v11 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of controllable alignment for LLMs, enabling more flexible and principled balancing of preferences like helpfulness and safety, though it builds incrementally on existing tangent-space fine-tuning methods.

The paper tackles the problem of balancing multiple human preference dimensions in large language models by proposing Tangent-Space Direct Preference Optimization (TS-DPO), which learns per-objective update directions that can be linearly combined at inference for user-specified behaviors, achieving broader Pareto-optimal coverage and smoother preference control compared to scalarized DPO on the helpfulness-verbosity trade-off.

Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods, including Direct Preference Optimization (DPO), collapse feedback into a single scalar reward, fixing one balance among objectives and preventing traversal of the Pareto front. Recent work by Ortiz-Jimenez et al. (2023) showed that fine-tuning can be viewed in a model's tangent space, where linearized updates act as additive vectors that can be composed to jointly perform well on multiple tasks. Building on this formulation, we extend this idea to preference alignment and propose Tangent-Space Direct Preference Optimization (TS-DPO), which performs DPO within this locally linear regime to learn per-objective update directions. These directions can be linearly combined at inference to generate user-specified behaviors without additional optimization. Evaluated on the helpfulness-verbosity trade-off using the HelpSteer and UltraFeedback datasets, TS-DPO achieves broader Pareto-optimal coverage and smoother preference control than scalarized DPO. Canonical Correlation Analysis (CCA) further shows that tangent-space training amplifies canonical directions aligned with distinct preferences, improving disentanglement.

View on arXiv PDF

Similar