ROJun 2
ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off ControlYutong Liang, Quanquan Peng, Ri-Zhao Qiu et al.
Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.
ROMar 10
Cross-Hand Latent Representation for Vision-Language-Action ModelsGuangqi Jiang, Yutong Liang, Jianglong Ye et al.
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
GRJan 9
DexterCap: An Affordable and Automated System for Capturing Dexterous Hand-Object ManipulationYutong Liang, Shiyi Xu, Yulong Zhang et al.
Capturing fine-grained hand-object interactions is challenging due to severe self-occlusion from closely spaced fingers and the subtlety of in-hand manipulation motions. Existing optical motion capture systems rely on expensive camera setups and extensive manual post-processing, while low-cost vision-based methods often suffer from reduced accuracy and reliability under occlusion. To address these challenges, we present DexterCap, a low-cost optical capture system for dexterous in-hand manipulation. DexterCap uses dense, character-coded marker patches to achieve robust tracking under severe self-occlusion, together with an automated reconstruction pipeline that requires minimal manual effort. With DexterCap, we introduce DexterHand, a dataset of fine-grained hand-object interactions covering diverse manipulation behaviors and objects, from simple primitives to complex articulated objects such as a Rubik's Cube. We release the dataset and code to support future research on dexterous hand-object interaction. Project website: https://pku-mocca.github.io/Dextercap-Page/
ROOct 23, 2025
GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic ManipulationGuangqi Jiang, Haoran Chang, Ri-Zhao Qiu et al.
This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: https://3dgsworld.github.io/.
LGSep 24, 2025
Sensor optimization for urban wind estimation with cluster-based probabilistic frameworkYutong Liang, Chang Hou, Guy Y. Cornejo Maceda et al.
We propose a physics-informed machine-learned framework for sensor-based flow estimation for drone trajectories in complex urban terrain. The input is a rich set of flow simulations at many wind conditions. The outputs are velocity and uncertainty estimates for a target domain and subsequent sensor optimization for minimal uncertainty. The framework has three innovations compared to traditional flow estimators. First, the algorithm scales proportionally to the domain complexity, making it suitable for flows that are too complex for any monolithic reduced-order representation. Second, the framework extrapolates beyond the training data, e.g., smaller and larger wind velocities. Last, and perhaps most importantly, the sensor location is a free input, significantly extending the vast majority of the literature. The key enablers are (1) a Reynolds number-based scaling of the flow variables, (2) a physics-based domain decomposition, (3) a cluster-based flow representation for each subdomain, (4) an information entropy correlating the subdomains, and (5) a multi-variate probability function relating sensor input and targeted velocity estimates. This framework is demonstrated using drone flight paths through a three-building cluster as a simple example. We anticipate adaptations and applications for estimating complete cities and incorporating weather input.