LG AIFeb 2

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-Chul Moon

arXiv:2602.01685v13.82 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses the alignment of large language models for better human preference matching, representing an incremental improvement in regularization methods.

The paper tackles the problem of aligning large language models with human preferences by addressing the limitation of KL divergence regularization, which fails to capture semantic similarity between tokens, and proposes Wasserstein Policy Regularization (WPR) that incorporates token space geometry, resulting in improved performance over baselines.

Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at https://github.com/aailab-kaist/WPR.

View on arXiv PDF Code

Similar