LGAIDec 8, 2025

Model-Based Reinforcement Learning Under Confounding

arXiv:2512.07528v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

It addresses a fundamental inconsistency in model-based RL for confounded settings, enabling principled learning and planning when contextual information is missing, which is incremental as it builds on existing frameworks like MaxCausalEnt.

The paper tackles model-based reinforcement learning in confounded environments where unobserved context biases offline data, by adapting a proximal off-policy evaluation method to identify reward expectations and integrating it with a behavior-averaged transition model to create a consistent surrogate MDP.

We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes