AILGLOROFeb 8, 2021

Learning Optimal Strategies for Temporal Tasks in Stochastic Games

arXiv:2102.04307v36 citations
AI Analysis

This work addresses the problem of synthesizing assured controllers for systems operating in unknown, adversarial environments, which is crucial for safety-critical applications. It offers an incremental improvement by enabling model-free learning for LTL-specified controllers.

This paper tackles the problem of synthesizing controllers for systems operating in unknown, stochastic, and adversarial environments, guided by linear temporal logic (LTL) specifications. The authors developed a model-free reinforcement learning approach that learns optimal control strategies to maximize the probability of satisfying LTL specifications against worst-case environment behavior, outperforming existing methods in various case studies.

Synthesis from linear temporal logic (LTL) specifications provides assured controllers for systems operating in stochastic and potentially adversarial environments. Automatic synthesis tools, however, require a model of the environment to construct controllers. In this work, we introduce a model-free reinforcement learning (RL) approach to derive controllers from given LTL specifications even when the environment is completely unknown. We model the problem as a stochastic game (SG) between the controller and the adversarial environment; we then learn optimal control strategies that maximize the probability of satisfying the LTL specifications against the worst-case environment behavior. We first construct a product game using the deterministic parity automaton (DPA) translated from the given LTL specification. By deriving distinct rewards and discount factors from the acceptance condition of the DPA, we reduce the maximization of the worst-case probability of satisfying the LTL specification into the maximization of a discounted reward objective in the product game; this enables the use of model-free RL algorithms to learn an optimal controller strategy. To deal with the common scalability problems when the number of sets defining the acceptance condition of the DPA (usually referred as colors), is large, we propose a lazy color generation method where distinct rewards and discount factors are utilized only when needed, and an approximate method where the controller eventually focuses on only one color. In several case studies, we show that our approach is scalable to a wide range of LTL formulas, significantly outperforming existing methods for learning controllers from LTL specifications in SGs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes