AIDec 23, 2025

Safety Alignment of LMs via Non-cooperative Games

arXiv:2512.20806v12 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses the critical problem of AI safety alignment for language models, offering a novel paradigm that could improve robustness and utility, though it appears incremental in its game-theoretic approach.

The paper tackles the challenge of aligning language models for safety without sacrificing utility by framing it as a non-zero-sum game between an Attacker and Defender LM, trained jointly via online reinforcement learning with preference-based rewards, resulting in a Defender LM that is more helpful and resilient to attacks and an Attacker LM that serves as a strong red-teaming agent.

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes