AIJun 25, 2017

Specifying Non-Markovian Rewards in MDPs Using LDL on Finite Traces (Preliminary Version)

arXiv:1706.08100v13 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a domain-specific issue in reinforcement learning for tasks requiring long-term behavior specifications, representing an incremental improvement over prior methods using LTL variants.

The paper tackles the problem of specifying non-Markovian rewards in MDPs, which are difficult to model with standard state-dependent rewards, by using LDLf on finite traces and provides an automata construction that offers minimality and compositionality guarantees.

In Markov Decision Processes (MDPs), the reward obtained in a state depends on the properties of the last state and action. This state dependency makes it difficult to reward more interesting long-term behaviors, such as always closing a door after it has been opened, or providing coffee only following a request. Extending MDPs to handle such non-Markovian reward function was the subject of two previous lines of work, both using variants of LTL to specify the reward function and then compiling the new model back into a Markovian model. Building upon recent progress in the theories of temporal logics over finite traces, we adopt LDLf for specifying non-Markovian rewards and provide an elegant automata construction for building a Markovian model, which extends that of previous work and offers strong minimality and compositionality guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes