Falsification-Based Robust Adversarial Reinforcement Learning
This work addresses the safety and generalization issues in reinforcement learning for autonomous vehicles, offering a novel integration of falsification methods, but it is incremental as it builds upon existing robust adversarial RL approaches.
The paper tackles the problem of reinforcement learning policies overfitting to training environments and failing to generalize to safety-critical test scenarios by proposing a falsification-based robust adversarial RL framework that integrates temporal logic falsification to improve policy robustness without needing extra reward functions for adversaries. The results show that policies trained with this approach generalize better and have less safety specification violations in test scenarios, with experimental evaluations on autonomous vehicle systems like braking assistance and adaptive cruise control.
Reinforcement learning (RL) has achieved enormous progress in solving various sequential decision-making problems, such as control tasks in robotics. Since policies are overfitted to training environments, RL methods have often failed to be generalized to safety-critical test scenarios. Robust adversarial RL (RARL) was previously proposed to train an adversarial network that applies disturbances to a system, which improves the robustness in test scenarios. However, an issue of neural network-based adversaries is that integrating system requirements without handcrafting sophisticated reward signals are difficult. Safety falsification methods allow one to find a set of initial conditions and an input sequence, such that the system violates a given property formulated in temporal logic. In this paper, we propose falsification-based RARL (FRARL): this is the first generic framework for integrating temporal logic falsification in adversarial learning to improve policy robustness. By applying our falsification method, we do not need to construct an extra reward function for the adversary. Moreover, we evaluate our approach on a braking assistance system and an adaptive cruise control system of autonomous vehicles. Our experimental results demonstrate that policies trained with a falsification-based adversary generalize better and show less violation of the safety specification in test scenarios than those trained without an adversary or with an adversarial network.