Knowledge Transfer from Teachers to Learners in Growing-Batch Reinforcement Learning
This addresses sample complexity challenges for RL applications in safety-critical domains where continuous interaction is infeasible, though it appears incremental as it builds on existing actor-critic methods.
The paper tackles the problem of sample inefficiency and limited data coverage in growing-batch reinforcement learning, where policies are updated from batches of data collected in real-world domains like healthcare and robotics, and shows that leveraging teacher information (e.g., demonstrations, expert actions) can mitigate these issues, with validation on DeepMind Control Suite tasks.
Standard approaches to sequential decision-making exploit an agent's ability to continually interact with its environment and improve its control policy. However, due to safety, ethical, and practicality constraints, this type of trial-and-error experimentation is often infeasible in many real-world domains such as healthcare and robotics. Instead, control policies in these domains are typically trained offline from previously logged data or in a growing-batch manner. In this setting a fixed policy is deployed to the environment and used to gather an entire batch of new data before being aggregated with past batches and used to update the policy. This improvement cycle can then be repeated multiple times. While a limited number of such cycles is feasible in real-world domains, the quality and diversity of the resulting data are much lower than in the standard continually-interacting approach. However, data collection in these domains is often performed in conjunction with human experts, who are able to label or annotate the collected data. In this paper, we first explore the trade-offs present in this growing-batch setting, and then investigate how information provided by a teacher (i.e., demonstrations, expert actions, and gradient information) can be leveraged at training time to mitigate the sample complexity and coverage requirements for actor-critic methods. We validate our contributions on tasks from the DeepMind Control Suite.