Task Aware Dreamer for Task Generalization in Reinforcement Learning
This work addresses the challenge of training RL agents to generalize across tasks with different rewards, which is crucial for real-world adaptability, though it appears incremental by extending world models with reward-informed features.
The paper tackles the problem of task generalization in reinforcement learning by introducing Task Aware Dreamer (TAD), a method that integrates reward-informed features into world models to improve adaptability across tasks with varying reward functions. The results show that TAD significantly enhances performance on both seen and unseen tasks, particularly for those with high Task Distribution Relevance (TDR), as demonstrated in image-based and state-based experiments.
A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.