ROMay 28Code
A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant ParadigmsYufei Jia, Zhanxiang Cao, Mingrui Yu et al.
Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption. Our view is that, in simulation-dominated robot control, the essential question is not which processor runs physics, but whether simulation throughput, policy learning, and runtime synchronization form an efficient end-to-end loop. We present UniLab, a heterogeneous CPU-simulation / GPU-learning architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. UniLab is implemented as a complete and extensible training system using MuJoCoUni and MotrixSim CPU-batched physics backends, supporting PPO, SAC, FlashSAC, TD3, and APPO. On representative simulation-based robot control tasks, UniLab improves end-to-end training efficiency by 3--10$\times$ under the same hardware configuration, while reducing dependence on the NVIDIA CUDA-based software stack and supporting cross-platform execution on the Apple macOS platform and the AMD ROCm and Intel XPU accelerator backends. These results show that GPU simulation is an effective path to efficient training, but not a necessary one, broadening the practical system choices available for robot RL training. Project page: https://github.com/unilabsim/UniLab.
CVJan 29
Causal World Modeling for Robot ControlLin Li, Qihang Zhang, Yiming Luo et al.
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
ROApr 28
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot LearningYufei Jia, Heng Zhang, Ziheng Zhang et al.
Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-Playground, a multi-modal simulation framework designed to accelerate end-to-end perceptual learning. We develop a novel high-performance parallel physics engine, specifically designed to integrate with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to ensure high-fidelity synchronization. Our system achieves a breakthrough throughput of 10^4 FPS at 640x480 resolution, significantly lowering the barrier for large-scale visual RL. Additionally, we introduce an automated Real2Sim workflow that reconstructs photorealistic, physically consistent, and memory-efficient environments, streamlining the generation of complex simulation-ready scenes. Extensive experiments on locomotion, navigation, and manipulation demonstrate that GS-Playground effectively bridges the perceptual and physical gaps across diverse embodied tasks. Project homepage: https://gsplayground.github.io.
CVOct 31, 2025
HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial CompositionJiacheng Hong, Kunzhen Wu, Mingrui Yu et al.
Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
ROSep 23, 2021
Shape Control of Deformable Linear Objects with Offline and Online Learning of Local Linear Deformation ModelsMingrui Yu, Hanzhong Zhong, Xiang Li
The shape control of deformable linear objects (DLOs) is challenging, since it is difficult to obtain the deformation models. Previous studies often approximate the models in purely offline or online ways. In this paper, we propose a scheme for the shape control of DLOs, where the unknown model is estimated with both offline and online learning. The model is formulated in a local linear format, and approximated by a neural network (NN). First, the NN is trained offline to provide a good initial estimation of the model, which can directly migrate to the online phase. Then, an adaptive controller is proposed to achieve the shape control tasks, in which the NN is further updated online to compensate for any errors in the offline model caused by insufficient training or changes of DLO properties. The simulation and real-world experiments show that the proposed method can precisely and efficiently accomplish the DLO shape control tasks, and adapt well to new and untrained DLOs.
ROJul 1, 2021
Adaptive Control for Robotic Manipulation of Deformable Linear Objects with Offline and Online Learning of Unknown ModelsMingrui Yu, Hanzhong Zhong, Fangxun Zhong et al.
The deformable linear objects (DLOs) are common in both industrial and domestic applications, such as wires, cables, ropes. Because of its highly deformable nature, it is difficult for the robot to reproduce human's dexterous skills on DLOs. In this paper, the unknown deformation model is estimated in both the offline and online manners. The offline learning aims to provide a good approximation prior to the manipulation task, while the online learning aims to compensate the errors due to insufficient training (e.g. limited datasets) in the offline phase. The offline module works by constructing a series of supervised neural networks (NNs), then the online module receives the learning results directly and further updates them with the technique of adaptive NNs. A new adaptive controller is also proposed to allow the robot to perform manipulation tasks concurrently in the online phase. The stability of the closed-loop system and the convergence of task errors are rigorously proved with Lyapunov method. Simulation studies are presented to illustrate the performance of the proposed method.