LGApr 15, 2025

Reward Distance Comparisons Under Transition Sparsity

Clement Nyanhongo, Bruno Miranda Henrique, Eugene Santos

arXiv:2504.11508v1h-index: 1Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This addresses a computational and safety challenge in reinforcement learning for researchers and practitioners, though it appears incremental as an improvement over existing direct comparison methods.

The paper tackles the problem of comparing reward functions under transition sparsity, where existing direct comparison methods fail due to requiring high transition coverage. It introduces the Sparsity Resilient Reward Distance (SRRD) pseudometric, which eliminates this need and demonstrates practical efficacy across multiple domains.

Reward comparisons are vital for evaluating differences in agent behaviors induced by a set of reward functions. Most conventional techniques utilize the input reward functions to learn optimized policies, which are then used to compare agent behaviors. However, learning these policies can be computationally expensive and can also raise safety concerns. Direct reward comparison techniques obviate policy learning but suffer from transition sparsity, where only a small subset of transitions are sampled due to data collection challenges and feasibility constraints. Existing state-of-the-art direct reward comparison methods are ill-suited for these sparse conditions since they require high transition coverage, where the majority of transitions from a given coverage distribution are sampled. When this requirement is not satisfied, a distribution mismatch between sampled and expected transitions can occur, leading to significant errors. This paper introduces the Sparsity Resilient Reward Distance (SRRD) pseudometric, designed to eliminate the need for high transition coverage by accommodating diverse sample distributions, which are common under transition sparsity. We provide theoretical justification for SRRD's robustness and conduct experiments to demonstrate its practical efficacy across multiple domains.

View on arXiv PDF

Similar