LGFLFeb 6, 2025

Learning Reward Machines from Partially Observed Policies

arXiv:2502.03762v33 citationsh-index: 3Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses the challenge of inverse reinforcement learning for reward machines, which is incremental as it builds on existing methods but extends to partially observed policies.

The authors tackled the problem of inferring reward machines from partially observed optimal policies, proposing a SAT-based algorithm that recovers the exact reward machine up to an equivalence class given sufficient finite information, with demonstrations on discrete, continuous, and real-world examples.

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes