LGMar 25, 2022

Preprocessing Reward Functions for Interpretability

Berkeley
arXiv:2203.13553v18 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the challenge of ensuring learned reward functions align with user preferences in real-world applications, representing an incremental improvement over existing interpretability methods.

The paper tackles the problem of validating learned reward functions in reinforcement learning by proposing a preprocessing step to simplify them before applying interpretability tools, resulting in significantly easier-to-understand rewards as shown empirically.

In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes