MDPs with a State Sensing Cost
This addresses a practical issue in resource-constrained environments like robotics or IoT, but it is incremental as it builds on existing MDP frameworks with a specific cost extension.
The paper tackles the problem of sequential decision-making where sensing the state incurs a cost, by formulating it as a Markov Decision Process (MDP) with an expanded state space. They propose a computationally efficient algorithm, SPI, which performs close to optimal in practice, as demonstrated in a numerical case study.
In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/communication/computation cost. In these settings, the agent's interaction with its environment includes the additional component of deciding when to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining `blind' to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we derive lower bounds on the optimal value function, which allow us to bound the suboptimality gap of any policy. We also propose a computationally efficient algorithm SPI, based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state-of-the-art via a numerical case study.