CLApr 28, 2025

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Zae Myung Kim, Chanwoo Park, Vipul Raheja, Suin Kim, Dongyeop Kang

DeepMind

arXiv:2504.20157v29.64 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This addresses practical challenges in aligning LLMs for researchers and practitioners, offering a more robust and adaptable method, though it is incremental as it builds on existing meta-learning and RL alignment frameworks.

The paper tackles reward hacking and brittle prompt engineering in reward-based alignment for large language models by introducing Meta Policy Optimization (MPO), which uses a meta-reward model to dynamically refine prompts during training, achieving performance on par with or better than extensively hand-crafted prompts across diverse tasks.

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, from essay writing to mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and data can be accessed at: https://github.com/minnesotanlp/mpo

View on arXiv PDF Code

Similar