LGAIJan 26

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

arXiv:2601.18292v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses safety risks in large language models, offering an incremental improvement over existing collaborative frameworks for safety alignment.

The paper tackles the problem of mitigating toxic and harmful content generation in large language models by proposing TriPlay-RL, a closed-loop reinforcement learning framework that enables iterative collaboration among attacker, defender, and evaluator roles with near-zero manual annotation, resulting in improvements such as a 20%-50% increase in adversarial effectiveness for the attacker and 10%-30% gains in safety performance for the defender.

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes