LG AIApr 8

Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm

arXiv:2604.074282.8h-index: 2

Predicted impact top 97% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses safety issues in reinforcement learning for platform-mediated systems, but it is incremental as it builds on existing methods with a novel adaptation.

The paper tackled the problem of delayed harm in reinforcement learning, where harmful effects reoccur after a washout period, by introducing Regret-Aware Policy Optimization (RAPO) to suppress replay, reducing re-amplification gain from 0.98 to 0.33 on 250-node graphs while retaining 82% of task return.

Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded, mass-preserving transition reweighting to reduce reachability of historically harmful regions. On graph diffusion tasks (50-1000 nodes), RAPO suppresses replay, reducing re-amplification gain (RAG) from 0.98 to 0.33 on 250-node graphs while retaining 82\% of task return. Disabling transition deformation only during replay restores re-amplification (RAG 0.91), isolating environment-level deformation as the causal mechanism.

View on arXiv PDF

Similar