LGJun 22, 2022

Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning

Trevor McInroe, Lukas Schäfer, Stefano V. Albrecht

Microsoft

arXiv:2206.11396v29.68 citationsh-index: 25Has Code

Originality Incremental advance

AI Analysis

This addresses sample inefficiency in pixel-based RL for robotic control, offering an incremental improvement over existing representation learning methods.

The paper tackles the challenge of learning control from pixels in reinforcement learning by proposing HKSL, an auxiliary task that learns multi-horizon representations through hierarchical forward models and step-skipping critics. It shows that HKSL converges faster to higher or optimal returns in robotic control tasks, with improved sample efficiency and accurate task-relevant representation across timescales.

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of $n$-step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

View on arXiv PDF Code

Similar