LG MLFeb 5, 2024

Sample Complexity Characterization for Linear Contextual MDPs

Junze Deng, Yuan Cheng, Shaofeng Zou, Yingbin Liang

arXiv:2402.02700v19.26 citationsh-index: 36AISTATS

Originality Highly original

AI Analysis

This work addresses the theoretical gap in CMDPs, which model time-varying environments in reinforcement learning, offering improved sample complexity bounds for researchers and practitioners in AI.

The authors tackled the problem of sample complexity for linear contextual Markov decision processes (CMDPs) by proposing model-based algorithms for two linear function approximation models, achieving guaranteed ε-suboptimality with polynomial sample complexity. Their results include removing a reachability assumption for tabular CMDPs and providing the first-known result for a specific model, showing that context-varying features improve sample efficiency over common representations.

Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $ε$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.

View on arXiv PDF

Similar