Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion
For humanoid locomotion, this method addresses the entanglement of perceptual roles in terrain encoding, enabling robust performance on challenging terrain where previous methods fail.
GLAD introduces a global-local attention decomposition for terrain encoding that separates broad context awareness from precise foothold selection, enabling humanoid robots to reliably traverse sparse-foothold terrain and obstacle-rich environments, with zero-shot sim-to-real transfer demonstrated on a Unitree G1 robot.
Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.