AIApr 19

Poly-EPO: Training Exploratory Reasoning Models

arXiv:2604.1765497.1h-index: 67
Predicted impact top 7% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in language model reasoning, Poly-EPO offers a method to enhance exploration-exploitation synergy, but the gains are incremental over existing RL-based post-training approaches.

Poly-EPO trains language models to generate diverse, exploratory reasoning strategies via set reinforcement learning, improving generalization and test-time compute scaling across reasoning benchmarks.

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes