CLOct 28, 2024

CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling

arXiv:2410.21545v210 citationsh-index: 28
AI Analysis

This addresses the problem of flawed reward signals in RLHF for AI alignment, offering a novel method to improve model outputs, though it is incremental in the context of existing reward modeling approaches.

The paper tackles reward hacking in large language models by proposing CARMO, which generates dynamic, context-aware criteria for reward modeling, resulting in a 2.1% improvement on Reward Bench and significant gains in alignment metrics like 22.5% LC-WR on Mistral-Base.

Reward modeling in large language models is susceptible to reward hacking, causing models to latch onto superficial features such as the tendency to generate lists or unnecessarily long responses. In reinforcement learning from human feedback (RLHF) and more generally during post-training flawed reward signals often lead to outputs that optimize for these spurious correlates instead of genuine quality or correctness. We propose Context-Aware Reward Modeling (CARMO), a novel approach that first generates dynamic, context-relevant criteria to ground the reward model before producing reward scores. Unlike prior methods that rely on static rubrics, CARMO leverages large language models (LLMs) to adaptively create evaluation criteria such as logical consistency, clarity, and depth tailored to the user query. Our theoretical analysis shows that such criteria generation can mitigate reward hacking. We further demonstrate that CARMO can be distilled into smaller models, reducing the computational cost of alignment. We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1\% improvement on Reward Bench. Furthermore, alignment performed on the CARMO-curated preference dataset achieves 22.5\% and 21.1\% LC-WR and WR, respectively, on Mistral-Base (7B).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes