ROMar 6

Hierarchical Latent Action Model

arXiv:2603.05815v11 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the problem of discovering high-level, long-term skills from actionless video data, which is significant for applications in robotic control and interactive world models.

The paper introduces HiLAM, a hierarchical latent action model designed to discover high-level latent skills from actionless videos by modeling long-term temporal information. It improves upon existing Latent Action Models (LAMs) that typically focus on short-horizon frame transitions and low-level motion, demonstrating robust dynamic skill discovery.

Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes