ROMay 14
XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion RepresentationsShichao Fan, Kun Wu, Zhengping Che et al.
Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.
ROApr 9
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body ManipulationShuanghao Bai, Meng Li, Xinyuan Lv et al.
Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
SYDec 28, 2017
Aircraft trajectory control with feedback linearization for general nonlinear systemSheng Zhang, Fei Liao, Yanqing Chen et al.
The feedback linearization method is further developed for the controller design on general nonlinear systems. Through the Lyapunov stability theory, the intractable nonlinear implicit algebraic control equations are effectively solved, and the asymptotically tracking performance is guaranteed. Moreover, it is proved that the controller may be used in an inverse-free version to the set-point control. With this method, a nonlinear aircraft outer-loop trajectory controller is developed. For the concern regarding the controller's robustness, the integral control technique is combined to counteract the adverse effect from modeling errors. Simulation results verify the well performance of the proposed controller.
ROFeb 18
RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task GenerationYixue Zhang, Kun Wu, Zhi Gao et al.
The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet under-explored challenge. Existing manual methods are unscalable and biased toward common tasks, while off-the-shelf foundation models often hallucinate physically infeasible instructions. To address this, we introduce RoboGene, an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks across single-arm, dual-arm, and mobile robots. RoboGene integrates three core components: diversity-driven sampling for broad task coverage, self-reflection mechanisms to enforce physical constraints, and human-in-the-loop refinement for continuous improvement. We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories and introducing novel metrics to assess task quality, feasibility, and diversity. Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models (e.g., GPT-4o, Gemini 2.5 Pro). Furthermore, real-world experiments show that VLA models pre-trained with RoboGene achieve higher success rates and superior generalization, underscoring the importance of high-quality task generation. Our project is available at https://robogene-boost-vla.github.io.
SYJan 2, 2018
Computation of Optimal Control Problems with Terminal Constraint via Variation EvolutionSheng Zhang, Bo Liao, Fei Liao
Enlightened from the inverse consideration of the stable continuous-time dynamics evolution, the Variation Evolving Method (VEM) analogizes the optimal solution to the equilibrium point of an infinite-dimensional dynamic system and solves it in an asymptotically evolving way. In this paper, the compact version of the VEM is further developed for the computation of Optimal Control Problems (OCPs) with terminal constraint. The corresponding Evolution Partial Differential Equation (EPDE), which describes the variation motion towards the optimal solution, is derived, and the costate-free optimality conditions are established. The explicit analytic expressions of the costates and the Lagrange multipliers adjoining the terminal constraint, related to the states and the control variables, are presented. With the semi-discrete method in the field of PDE numerical calculation, the EPDE is discretized as finite-dimensional Initial-value Problems (IVPs) to be solved, with common Ordinary Differential Equation (ODE) numerical integration methods.
SYJan 29, 2018
Computation of Optimal Control Problems with Terminal Constraint via Modified Evolution Partial Differential EquationSheng Zhang, Kai-Feng He, Fei Liao
The Variation Evolving Method (VEM), which seeks the optimal solutions with the variation evolution principle, is further developed to be more flexible in solving the Optimal Control Problems (OCPs) with terminal constraint. With the first-order stable dynamics to eliminate the infeasibilities, the Modified Evolution Partial Differential Equation (MEPDE) that is valid in the infeasible solution domain is proposed, and a Lyapunov functional is constructed to theoretically ensure its validity. In particular, it is proved that even with the infinite-time convergence dynamics, the violated terminal inequality constraints, which are inactive for the optimal solution, will enter the feasible domain in finite time. Through transforming the MEPDE to the finite-dimensional Initial-value Problem (IVP) with the semi-discrete method, the OCPs may be solved with common Ordinary Differential Equation (ODE) numerical integration methods. Illustrative examples are presented to show the effectiveness of the proposed method.
SYJan 26, 2025
The Third Evolution Equation for Optimal Control ComputationSheng Zhang, Fei Liao, Kai-Feng He
The Variation Evolving Method (VEM) that originates from the continuous-time dynamics stability theory seeks the optimal solutions with variation evolution principle. After establishing the first and the second evolution equations within its frame, the third evolution equation is developed. This equation only solves the control variables along the variation time to get the optimal solution, and its definite conditions may be arbitrary since the equation can eliminate possible infeasibilities. With this equation, the dimension of the resulting Initial-value Problem (IVP), transformed via the semi-discrete method, is greatly reduced. Therefore it might relieve the computation burden in seeking solutions. Illustrative examples are solved and it is shown that the proposed equation may produce more precise numerical solutions than the second evolution equation, and its computation time may be shorter for the dense discretization.
CLApr 9Code
Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge ModelKunfeng Chen, Luyao Zhuang, Fei Liao et al.
Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.
SYFeb 20, 2025
Compact Formulation of the First Evolution Equation for Optimal Control ComputationSheng Zhang, Fei Liao, Wei-Qi Qian
The first evolution equation is derived under the Variation Evolving Method (VEM) that seeks optimal solutions with the variation evolution principle. To improve the performance, its compact form is developed. By replacing the states and costates variation evolution with that of the controls, the dimension-reduced Evolution Partial Differential Equation (EPDE) only solves the control variables along the variation time to get the optimal solution, and its definite conditions may be arbitrary. With this equation, the scale of the resulting Initial-value Problem (IVP), transformed via the semi-discrete method, is significantly reduced. Illustrative examples are solved and it is shown that the compact form evolution equation outperforms the primary form in the precision, and the efficiency may be higher for the dense discretization. Moreover, in discussing the connections to the classic iteration methods, it is uncovered that the computation scheme of the gradient method is the discrete implementation of the third evolution equation, and the compact form of the first evolution equation is a continuous realization of the Newton type iteration mechanism.
IVOct 10, 2020Code
Selective Information Passing for MR/CT Image SegmentationQikui Zhu, Liang Li, Jiangnan Hao et al.
Automated medical image segmentation plays an important role in many clinical applications, which however is a very challenging task, due to complex background texture, lack of clear boundary and significant shape and texture variation between images. Many researchers proposed an encoder-decoder architecture with skip connections to combine low-level feature maps from the encoder path with high-level feature maps from the decoder path for automatically segmenting medical images. The skip connections have been shown to be effective in recovering fine-grained details of the target objects and may facilitate the gradient back-propagation. However, not all the feature maps transmitted by those connections contribute positively to the network performance. In this paper, to adaptively select useful information to pass through those skip connections, we propose a novel 3D network with self-supervised function, named selective information passing network (SIP-Net). We evaluate our proposed model on the MICCAI Prostate MR Image Segmentation 2012 Grant Challenge dataset, TCIA Pancreas CT-82 and MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset. The experimental results across these data sets show that our model achieved improved segmentation results and outperformed other state-of-the-art methods. The source code of this work is available at https://github.com/ahukui/SIPNet.
RODec 18, 2024
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot ManipulationKun Wu, Chengkai Hou, Jiaming Liu et al.
In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, and linguistic task descriptions. To ensure data consistency and reliability for imitation learning, RoboMIND is built on a unified data collection platform and a standardized protocol, covering four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction during policy learning. Additionally, we created a digital twin environment in the Isaac Sim simulator, replicating the real-world tasks and assets, which facilitates the low-cost collection of additional training data and enables efficient evaluation. To demonstrate the quality and diversity of our dataset, we conducted extensive experiments using various imitation learning methods for single-task settings and state-of-the-art Vision-Language-Action (VLA) models for multi-task scenarios. By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities. To the best of our knowledge, RoboMIND is the largest multi-embodiment teleoperation dataset collected on a unified platform, providing large-scale and high-quality robotic training data. Our project is at https://x-humanoid-robomind.github.io/.
CLMay 28, 2025
Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuningQihuang Zhong, Liang Ding, Fei Liao et al.
Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.