LGAISep 4, 2025

What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

arXiv:2509.03790v21 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the problem of inefficient learning in sparse-reward RL for AI and robotics applications, presenting an incremental advancement with a novel method for a known bottleneck.

The paper tackles the challenge of sparse-reward reinforcement learning by introducing Policy-Aware Matrix Completion (PAMC), which exploits low-rank and sparse structure in reward matrices under policy-biased sampling, and demonstrates improved sample efficiency across multiple benchmarks, such as Atari-26 with 10M steps, outperforming methods like DrQ-v2 and DreamerV3.

Sparse-reward reinforcement learning (RL) remains fundamentally hard: without structure, any agent needs $Ω(|\mathcal{S}||\mathcal{A}|/p)$ samples to recover rewards. We introduce Policy-Aware Matrix Completion (PAMC) as a first concrete step toward a structural reward learning framework. Our key idea is to exploit approximate low-rank + sparse structure in the reward matrix, under policy-biased (MNAR) sampling. We prove recovery guarantees with inverse-propensity weighting, and establish a visitation-weighted error-to-regret bound linking completion error to control performance. Importantly, when assumptions weaken, PAMC degrades gracefully: confidence intervals widen and the algorithm abstains, ensuring safe fallback to exploration. Empirically, PAMC improves sample efficiency across Atari-26 (10M steps), DM Control, MetaWorld MT50, D4RL offline RL, and preference-based RL benchmarks, outperforming DrQ-v2, DreamerV3, Agent57, T-REX/D-REX, and PrefPPO under compute-normalized comparisons. Our results highlight PAMC as a practical and principled tool when structural rewards exist, and as a concrete first instantiation of a broader structural reward learning perspective.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes