Generative Intrinsic Optimization: Intrinsic Control with Model Learning
This work addresses a foundational gap in reinforcement learning for researchers by theoretically enhancing sample efficiency and incorporating environmental uncertainty into decision-making, though it is incremental as it builds on existing concepts of intrinsic control and model learning.
The paper tackles the problem of integrating intrinsic motivation with reward maximization in reinforcement learning by proposing a policy iteration scheme that incorporates mutual information, ensuring convergence to the optimal policy. It introduces a variational approach to jointly learn necessary quantities for estimating mutual information and dynamics models, providing a general framework for different outcome forms.
Future sequence represents the outcome after executing the action into the environment (i.e. the trajectory onwards). When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a policy iteration scheme that seamlessly incorporates the mutual information, ensuring convergence to the optimal policy. Concurrently, a variational approach is introduced, which jointly learns the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.