CL AI LGSep 27, 2025

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng

arXiv:2509.23371v16.72 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses a key challenge in aligning LLMs with human preferences, offering a more efficient and adaptive solution for AI safety and performance, though it is incremental in improving existing preference optimization frameworks.

The paper tackles the distribution mismatch between offline preference data and evolving model policies in aligning large language models, proposing MetaAPO to dynamically couple data generation with training, which outperforms existing methods on benchmarks like AlpacaEval 2 and reduces online annotation costs by 42%.

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

View on arXiv PDF

Similar