LGRODec 6, 2021

Distilled Domain Randomization

arXiv:2112.03149v14 citations
Originality Incremental advance
AI Analysis

This addresses the problem of expensive real-world data collection for robot control by enabling direct transfer from simulation to reality, though it appears incremental as it builds on existing domain randomization and distillation techniques.

The paper tackles the reality gap in robot control by combining reinforcement learning from randomized physics simulations with policy distillation, resulting in a method called DiDoR that achieves performance comparable or better than baselines in sim-to-sim and sim-to-real experiments without requiring target domain data.

Deep reinforcement learning is an effective tool to learn robot control policies from scratch. However, these methods are notorious for the enormous amount of required training data which is prohibitively expensive to collect on real robots. A highly popular alternative is to learn from simulations, allowing to generate the data much faster, safer, and cheaper. Since all simulators are mere models of reality, there are inevitable differences between the simulated and the real data, often referenced as the 'reality gap'. To bridge this gap, many approaches learn one policy from a distribution over simulators. In this paper, we propose to combine reinforcement learning from randomized physics simulations with policy distillation. Our algorithm, called Distilled Domain Randomization (DiDoR), distills so-called teacher policies, which are experts on domains that have been sampled initially, into a student policy that is later deployed. This way, DiDoR learns controllers which transfer directly from simulation to reality, i.e., without requiring data from the target domain. We compare DiDoR against three baselines in three sim-to-sim as well as two sim-to-real experiments. Our results show that the target domain performance of policies trained with DiDoR is en par or better than the baselines'. Moreover, our approach neither increases the required memory capacity nor the time to compute an action, which may well be a point of failure for successfully deploying the learned controller.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes