Robust Temporal Difference Learning for Critical Domains
This work addresses the need for robust reinforcement learning in critical domains where rare events can have severe consequences, offering a novel method that is incremental in its approach.
The authors tackled the problem of learning robust policies in critical domains with significant rare events by introducing a new Q-function operator, the $κ$-operator, which enables robust temporal difference learning without observing these events. Empirical results showed superior performance in early learning and converged stages, with demonstrated robustness to model errors and applicability in multi-agent settings.
We present a new Q-function operator for temporal difference (TD) learning methods that explicitly encodes robustness against significant rare events (SRE) in critical domains. The operator, which we call the $κ$-operator, allows to learn a robust policy in a model-based fashion without actually observing the SRE. We introduce single- and multi-agent robust TD methods using the operator $κ$. We prove convergence of the operator to the optimal robust Q-function with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Q-function of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of $κ$-based TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability in a multi-agent context.