LG AIMay 13

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh

arXiv:2605.1355415.9

Predicted impact top 41% in LG · last 90 daysOriginality Highly original

AI Analysis

This work bridges contrastive RL with on-policy training, making self-supervised RL viable for discrete and multi-agent settings where off-policy methods struggle.

CPPO introduces the first on-policy contrastive RL algorithm, enabling self-supervised learning without reward functions or replay buffers. It outperforms prior CRL methods in 14/18 tasks and matches/exceeds PPO with hand-crafted rewards in 12/18 tasks.

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

View on arXiv PDF

Similar