LG AIJun 11, 2025

A Unified Theory of Compositionality, Modularity, and Interpretability in Markov Decision Processes

arXiv:2506.09499v14.1h-index: 32

Originality Highly original

AI Analysis

This work addresses the problem of scalable and interpretable planning for reinforcement learning agents in complex environments, presenting a foundational shift rather than an incremental improvement.

The paper tackles the challenge of achieving compositionality, modularity, and interpretability in Markov Decision Processes by introducing Option Kernel Bellman Equations (OKBEs) and state-time option kernels (STOKs), which enable efficient, verifiable long-horizon planning in high-dimensional settings without relying on reward maximization.

We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation-to-termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman-Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high-dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal-success and constraint-violation events, needed for formal verification. Given a high-dimensional state-transition model for an intractable planning problem, we can decompose it with local STOKs and goal-conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward-plan at the level of goals in high-dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta-policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long-horizon planning and intrinsic motivation that scales to dynamic high-dimensional world-models.

View on arXiv PDF

Similar