Hongrui Zhu

AI
h-index10
3papers
3citations
Novelty55%
AI Score39

3 Papers

ROMay 10Code
Wavelet Policy: Imitation Learning in the Scale Domain with World Prior Memory

Changchuan Yang, Yuhang Dong, Guanzhong Tian et al.

Conventional visuomotor imitation learning usually predicts future robot actions directly in the time domain. Such formulations often have limited physical scene awareness and weak long-horizon memory. In contrast, world-model-based perception and memory-augmented policies can improve world awareness with substantial computation overhead. In this work, we propose Wavelet Policy, a lightweight imitation learning framework that combines World Prior Memory (WPM) with wavelet-based multi-scale action modeling. Our key idea is to encode persistent physical scene structure from static background images into compact memory tokens, which are fused into world-prior tokens and injected into the encoder during forward propagation. Based on this memory-conditioned representation, We further perform wavelet-domain decomposition over horizon-aligned latent action tokens and adopt a Single-Encoder Multiple-Decoder (SE2MD) architecture to model latent components at different temporal scales. The resulting latent subbands are reconstructed through inverse wavelet transform and finally projected into executable action chunks. To facilitate efficient world prior learning, we introduce a world-prior adaptation loss, encouraging the background encoder to retain persistent scene knowledge while remaining lightweight and stable. Extensive experiments on four simulated and six real-world robotic manipulation tasks show that Wavelet Policy consistently outperforms strong baselines. These results demonstrate that combining scale-domain action modeling with world-prior memory provides an effective and efficient solution for long-horizon embodied manipulation. We release the source code, data and model checkpoint of simulation task at https://github.com/lurenjia384/Wavelet_Policy.

AIFeb 11, 2025
ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy

Yuhang Dong, Haizhou Ge, Yupei Zeng et al.

Visuomotor imitation learning policies enable robots to efficiently acquire manipulation skills from visual demonstrations. However, as scene complexity and visual distractions increase, policies that perform well in simple settings often experience substantial performance degradation. To address this challenge, we propose ImitDiff, a diffusion-based imitation learning policy guided by fine-grained semantics within a dual-resolution workflow. Leveraging pretrained priors of vision-language foundation models, our method transforms high-level instructions into pixel-level visual semantic masks. These masks guide a dual-resolution perception pipeline that captures both global context (e.g., overall layout) from low-resolution observation and fine-grained local features (e.g., geometric details) from high-resolution observation, enabling the policy to focus on task-relevant regions. Additionally, we introduce a consistency-driven diffusion transformer action head that bridges visual semantic conditions and real-time action generation. Extensive experiments demonstrate that ImitDiff outperforms state-of-the-art vision-language manipulation frameworks, as well as visuomotor imitation learning policies, particularly under increased scene complexity and visual distractions. Notably, ImitDiff exhibits strong generalization in zero-shot settings involving novel objects and visual distractions. Furthermore, our consistency-driven action head achieves an order-of-magnitude improvement in inference speed while maintaining competitive success rates.

LGNov 18, 2024
Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms

Haizhou Ge, Ruixiang Wang, Zhu-ang Xu et al.

Advanced imitation learning with structures like the transformer is increasingly demonstrating its advantages in robotics. However, deploying these large-scale models on embedded platforms remains a major challenge. In this paper, we propose a pipeline that facilitates the migration of advanced imitation learning algorithms to edge devices. The process is achieved via an efficient model compression method and a practical asynchronous parallel method Temporal Ensemble with Dropped Actions (TEDA) that enhances the smoothness of operations. To show the efficiency of the proposed pipeline, large-scale imitation learning models are trained on a server and deployed on an edge device to complete various manipulation tasks.