LGAIMay 13

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

arXiv:2605.1355415.9
Predicted impact top 41% in LG · last 90 daysOriginality Highly original
AI Analysis

This work bridges contrastive RL with on-policy training, making self-supervised RL viable for discrete and multi-agent settings where off-policy methods struggle.

CPPO introduces the first on-policy contrastive RL algorithm, enabling self-supervised learning without reward functions or replay buffers. It outperforms prior CRL methods in 14/18 tasks and matches/exceeds PPO with hand-crafted rewards in 12/18 tasks.

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes