CLFeb 24

Overton Pluralistic Reinforcement Learning for Large Language Models

arXiv:2602.20759v11.11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of capturing pluralistic human values in AI alignment, which is important for making language models more representative and useful, though it appears incremental as it builds on existing RL and pluralism concepts.

The paper tackles the limitation of existing alignment paradigms in capturing diverse human values by introducing OP-GRPO, a reinforcement learning framework that enables a single large language model to generate pluralistic responses without explicit prompting. The trained Qwen2.5-3B-Instruct model achieved a 37.4% relative accuracy gain over a 20B GPT-OSS baseline on a Natural Language Inference benchmark.

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

View on arXiv PDF

Similar