LGJul 25, 2023

Submodular Reinforcement Learning

Manish Prajapat, Mojmír Mutný, Melanie N. Zeilinger, Andreas Krause

arXiv:2307.13372v218.827 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

This work addresses a fundamental limitation in RL for applications like coverage control and experiment design, offering a novel paradigm that is incremental in extending RL to handle submodular rewards.

The paper tackles the problem of reinforcement learning with non-additive, history-dependent rewards that exhibit diminishing returns, proposing Submodular RL (SubRL) and a policy gradient algorithm called SubPO, which achieves optimal constant factor approximations under certain assumptions and demonstrates sample efficiency and scalability in applications like biodiversity monitoring and Bayesian experiment design.

In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $\textit{independent}$ of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose $\textit{submodular RL}$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy gradient-based algorithm for SubRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), SubPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing SubRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying SubPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.

View on arXiv PDF Code

Similar