Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space
This addresses reward optimization for POMDPs, which is an incremental improvement over existing methods.
The paper tackles reward optimization in partially observable Markov decision processes (POMDPs) with memoryless stochastic policies by formulating it as a linear objective with polynomial constraints, presenting the ROSA approach. The result shows that ROSA is computationally efficient and yields stability improvements in maze navigation tasks.
Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods.