AIApr 20

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

arXiv:2604.1769699.1h-index: 11
Predicted impact top 1% in AI · last 90 daysOriginality Highly original
AI Analysis

For researchers in language model reasoning, this work addresses the problem of learning domain-agnostic reasoning patterns rather than game-specific heuristics.

STRATAGEM introduces a self-play framework that uses trajectory-modulated rewards to learn transferable reasoning in language models, achieving substantial improvements on competition-level mathematics and other reasoning benchmarks.

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes