Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching
This work addresses the problem of limited reusability and robustness in reinforcement learning for dynamically changing environments, though it is incremental as it builds on existing semantic alignment techniques.
The paper tackles the problem of deep reinforcement learning models failing to generalize to small changes in environment observations or tasks, which typically requires costly retraining, by proposing a zero-shot method that maps latent spaces between agents using semantic alignment to enable modular policy reuse without fine-tuning. The result is demonstrated empirically with high performance preservation under visual and task domain shifts, such as in the CarRacing environment with changing backgrounds and tasks.
Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment's observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent's encoder to another agent's encoder without further fine-tuning. Our approach relies on a small set of "anchor" observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.