Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning
This addresses the problem of reward function inflexibility in reinforcement learning for researchers and practitioners, though it appears incremental as it builds on existing operator network concepts.
The paper tackles the problem of reinforcement learning algorithms being limited to single reward functions by proposing an operator view that maps reward functions to value functions, enabling zero-shot adaptation to unseen rewards. Their operator deep Q-learning framework outperforms existing methods in offline policy evaluation and optimization tasks.
Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.