CLAILGMar 23, 2025

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

arXiv:2503.18991v521 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses safety alignment for LLM deployment, but appears incremental as it builds on existing reward-based pipelines and optimization methods.

The paper tackles the problem of aligning large language models (LLMs) for safety by addressing imbalanced safety datasets and static reward models that limit optimization efficiency, proposing DR-IRL which uses inverse reinforcement learning with dynamic reward scaling to outperform all baseline methods in safety alignment while maintaining usefulness.

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (train a reward model on preference pairs and optimize with reinforcement learning) or reward-free (directly fine-tune on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain robust, and single-response demonstrations can outperform pairwise preference data. However, two challenges persist: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning). We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes