77.1LGJun 1
HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought CompressionMinghui Zheng, Hongxu Chen, Huimin Ren et al.
Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.
ROJul 30, 2023
TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion PredictionSibo Tian, Minghui Zheng, Xiao Liang
Predicting human motion plays a crucial role in ensuring a safe and effective human-robot close collaboration in intelligent remanufacturing systems of the future. Existing works can be categorized into two groups: those focusing on accuracy, predicting a single future motion, and those generating diverse predictions based on observations. The former group fails to address the uncertainty and multi-modal nature of human motion, while the latter group often produces motion sequences that deviate too far from the ground truth or become unrealistic within historical contexts. To tackle these issues, we propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction which can generate samples that are more likely to happen while maintaining a certain level of diversity. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. Additionally, we employ the discrete cosine transform to model motion sequences in the frequency space, thereby improving performance. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization to condition the prediction on past observed motion, we treat all inputs, including conditions, as tokens to create a more lightweight model compared to existing approaches. Extensive experimental studies are conducted on benchmark datasets to validate the effectiveness of our human motion prediction model.
14.7ROMay 15Code
Constrained MPC-Based Motion Planning for Morphing Quadrotors in Ultra-Narrow Passages under Limited PerceptionHarsh Modi, Xiao Liang, Minghui Zheng
This paper introduces a motion planning framework to plan morphology and trajectory for morphing quadrotors under extremely constrained environments. We develop a novel obstacle avoidance cost function for nonlinear model predictive control (MPC) that enables navigation through extremely narrow gaps under limited perception from a 2D LiDAR. Classical artificial potential field-based costs typically have a high cost in narrow passages, artificially blocking the navigable path. In contrast, we propose a smooth exponential obstacle cost that preserves low traversal cost within narrow gaps while maintaining strong collision avoidance behavior. The formulation avoids hard activation thresholds and introduces a cost reduction factor to reduce the cost within narrow passages. Direct use of 2D LiDAR measurements in MPC allows navigation around arbitrarily shaped obstacles. The method is embedded within an acados-based nonlinear MPC framework. Simulation and experimental results demonstrate successful traversal of narrow corridors where typical repulsive cost functions would fail. The approach provides a computationally efficient and practical solution for navigating through tight spaces while maintaining safety from the obstacles. While we are implementing the framework on the morphing quadrotors, the cost function formulation is general-purpose for any mobile robot application, and is not limited to the morphing quadrotors. The implementation code is available at \href{https://github.com/harshjmodi1996/morphocopter_mpc}{Github Repo} and a short video is available at \href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/MPC_MorphoCopter_video.mp4}{Video Link}.
78.7ROMar 10
SELF-VLA: A Skill Enhanced Agentic Vision-Language-Action Framework for Contact-Rich DisassemblyChang Liu, Sibo Tian, Xiao Liang et al.
Disassembly automation has long been pursued to address the growing demand for efficient and proper recovery of valuable components from the end-of-life (EoL) electronic products. Existing approaches have demonstrated promising and regimented performance by decomposing the disassembly process into different subtasks. However, each subtask typically requires extensive data preparation, model training, and system management. Moreover, these approaches are often task- and component-specific, making them poorly suited to handle the variability and uncertainty of EoL products and limiting their generalization capabilities. All these factors restrict the practical deployment of current robotic disassembly systems and leave them highly reliant on human labor. With the recent development of foundation models in robotics, vision-language-action (VLA) models have shown impressive performance on standard robotic manipulation tasks, but their applicability to complex, contact-rich, and long-horizon industrial practices like disassembly, which requires sequential and precise manipulation, remains limited. To address this challenge, we propose SELF-VLA, an agentic VLA framework that integrates explicit disassembly skills. Experimental studies demonstrate that our framework significantly outperforms current state-of-the-art end-to-end VLA models on two contact-rich disassembly tasks. The video illustration can be found via https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/IROS-VLA-Video.mp4.
60.7ROMar 25
Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and OpportunitiesDavood Soleymanzadeh, Ivan Lopez-Sanchez, Hao Su et al.
State-of-the-art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low-level motion planning and control. Motion planning remains challenging due to the high dimensionality of the robot's configuration space and the presence of workspace obstacles. Neural motion planners have enhanced motion planning efficiency by offering fast inference and effectively handling the inherent multi-modality of the motion planning problem. Despite such benefits, current neural motion planners often struggle to generalize to unseen, out-of-distribution planning settings. This paper reviews and analyzes the state-of-the-art neural motion planners, highlighting both their benefits and limitations. It also outlines a path toward establishing generalist neural motion planners capable of handling domain-specific challenges. For a list of the reviewed papers, please refer to https://davoodsz.github.io/planning-manip-survey.github.io/.
CVSep 19, 2024
Bayesian-Optimized One-Step Diffusion Model with Knowledge Distillation for Real-Time 3D Human Motion PredictionSibo Tian, Minghui Zheng, Xiao Liang
Human motion prediction is a cornerstone of human-robot collaboration (HRC), as robots need to infer the future movements of human workers based on past motion cues to proactively plan their motion, ensuring safety in close collaboration scenarios. The diffusion model has demonstrated remarkable performance in predicting high-quality motion samples with reasonable diversity, but suffers from a slow generative process which necessitates multiple model evaluations, hindering real-world applications. To enable real-time prediction, in this work, we propose training a one-step multi-layer perceptron-based (MLP-based) diffusion model for motion prediction using knowledge distillation and Bayesian optimization. Our method contains two steps. First, we distill a pretrained diffusion-based motion predictor, TransFusion, directly into a one-step diffusion model with the same denoiser architecture. Then, to further reduce the inference time, we remove the computationally expensive components from the original denoiser and use knowledge distillation once again to distill the obtained one-step diffusion model into an even smaller model based solely on MLPs. Bayesian optimization is used to tune the hyperparameters for training the smaller diffusion model. Extensive experimental studies are conducted on benchmark datasets, and our model can significantly improve the inference speed, achieving real-time prediction without noticeable degradation in performance.
81.5CVMar 13
Egocentric World Model for Photorealistic Hand-Object Interaction SynthesisDayou Li, Lulin Liu, Bangya Liu et al.
To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
97.3ROMar 13
Learning Actionable Manipulation Recovery via Counterfactual Failure SynthesisDayou Li, Jiuzhou Lei, Hao Wang et al.
While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
68.2ROMar 10
TATIC: Task-Aware Temporal Learning for Human Intent Inference from Physical Corrections in Human-Robot CollaborationJiurun Song, Xiao Liang, Minghui Zheng
In human-robot collaboration (HRC), robots must adapt online to dynamic task constraints and evolving human intent. While physical corrections provide a natural, low-latency channel for operators to convey motion-level adjustments, extracting task-level semantic intent from such brief interactions remains challenging. Existing foundation-model-based approaches primarily rely on vision and language inputs and lack mechanisms to interpret physical feedback. Meanwhile, traditional physical human-robot interaction (pHRI) methods leverage physical corrections for trajectory guidance but struggle to infer task-level semantics. To bridge this gap, we propose TATIC, a unified framework that utilizes torque-based contact force estimation and a task-aware Temporal Convolutional Network (TCN) to jointly infer discrete task-level intent and estimate continuous motion-level parameters from brief physical corrections. Task-aligned feature canonicalization ensures robust generalization across diverse layouts, while an intent-driven adaptation scheme translates inferred human intent into robot motion adaptations. Experiments achieve a 0.904 Macro-F1 score in intent recognition and demonstrate successful hardware validation in collaborative disassembly (see experimental video at https://youtu.be/xF8A52qwEc8).
26.9ROMar 10
DRAFTO: Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization for Robotic ManipulatorsYichang Feng, Xiao Liang, Minghui Zheng
This paper introduces a new algorithm for trajectory optimization, Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization (DRAFTO). It first constructs a constrained objective that accounts for smoothness, safety, joint limits, and task requirements. Then, it optimizes the coefficients, which are the coordinates of a set of basis functions for trajectory parameterization. To reduce the number of repeated constrained optimizations while handling joint-limit feasibility, the optimization is decoupled into a reduced-space Gauss-Newton (GN) descent for the main iterations and constrained quadratic programming for initialization and terminal feasibility repair. The two-phase acceptance rule with a non-monotone policy is applied to the GN model, which uses a hinge-squared penalty for inequality constraints, to ensure globalizability. The results of our benchmark tests against optimization-based planners, such as CHOMP, TrajOpt, GPMP2, and FACTO, and sampling-based planners, such as RRT-Connect, RRT*, and PRM, validate the high efficiency and reliability across diverse scenarios and tasks. The experiment involving grabbing an object from a drawer further demonstrates the potential for implementation in complex manipulation tasks. The supplemental video is available at https://youtu.be/XisFI37YyTQ.
53.5ROApr 1
Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware ManipulationJiuzhou Lei, Chang Liu, Yu She et al.
Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.
35.0CVMar 28
Evaluating Large and Lightweight Vision Models for Irregular Component Segmentation in E-Waste DisassemblyXinyao Zhang, Chang Liu, Xiao Liang et al.
Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.
SEFeb 26, 2024
LangGPT: Rethinking Structured Reusable Prompt Design Framework for LLMs from the Programming LanguageMing Wang, Yuanzhong Liu, Xiaoyu Liang et al.
LLMs have demonstrated commendable performance across diverse domains. Nevertheless, formulating high-quality prompts to instruct LLMs proficiently poses a challenge for non-AI experts. Existing research in prompt engineering suggests somewhat scattered optimization principles and designs empirically dependent prompt optimizers. Unfortunately, these endeavors lack a structured design template, incurring high learning costs and resulting in low reusability. In addition, it is not conducive to the iterative updating of prompts. Inspired by structured reusable programming languages, we propose LangGPT, a dual-layer prompt design framework as the programming language for LLMs. LangGPT has an easy-to-learn normative structure and provides an extended structure for migration and reuse. Experiments illustrate that LangGPT significantly enhances the performance of LLMs. Moreover, the case study shows that LangGPT leads LLMs to generate higher-quality responses. Furthermore, we analyzed the ease of use and reusability of LangGPT through a user survey in our online community.
33.1ROApr 8
Flow Motion Policy: Manipulator Motion Planning with Flow Matching ModelsDavood Soleymanzadeh, Xiao Liang, Minghui Zheng
Open-loop end-to-end neural motion planners have recently been proposed to improve motion planning for robotic manipulators. These methods enable planning directly from sensor observations without relying on a privileged collision checker during planning. However, many existing methods generate only a single path for a given workspace across different runs, and do not leverage their open-loop structure for inference-time optimization. To address this limitation, we introduce Flow Motion Policy, an open-loop, end-to-end neural motion planner for robotic manipulators that leverages the stochastic generative formulation of flow matching methods to capture the inherent multi-modality of planning datasets. By modeling a distribution over feasible paths, Flow Motion Policy enables efficient inference-time best-of-$N$ sampling. The method generates multiple end-to-end candidate paths, evaluates their collision status after planning, and executes the first collision-free solution. We benchmark the Flow Motion Policy against representative sampling-based and neural motion planning methods. Evaluation results demonstrate that Flow Motion Policy improves planning success and efficiency, highlighting the effectiveness of stochastic generative policies for end-to-end motion planning and inference-time optimization. Experimental evaluation videos are available via this \href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/FMP-Website.mp4}{link}.
81.2SYApr 3
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing SystemsSibo Tian, Xiao Liang, Sara Behdad et al.
Remanufacturing is fundamentally more challenging than traditional manufacturing due to the significant uncertainty, variability, and incompleteness inherent in end-of-life (EoL) products. At the same time, it has become increasingly essential and urgent for facilitating a circular economy, driven by the growing volume of discarded electronic products and the escalating scarcity of critical materials. In this paper, we review the existing literature and examine the key challenges as well as emerging opportunities in intelligent automation for EoL electronics remanufacturing, providing a comprehensive overview of how robotics, control, and artificial intelligence (AI) can jointly enable scalable, safe, and intelligent remanufacturing systems. This paper starts with the definition, scope, and motivation of remanufacturing within the context of a circular economy, highlighting its societal and environmental significance. Then it delves into intelligent automation approaches for disassembly, inspection, sorting, and component reprocessing in this domain, covering advanced methods for multimodal perception, decision-making under uncertainty, flexible planning algorithms, and force-aware manipulation. The paper further reviews several emerging techniques, including large foundation models, human-in-the-loop integration, and digital twins that have the potential to support future research in this area. By integrating these topics, we aim to illustrate how next-generation remanufacturing systems can achieve robust, adaptable, and efficient operation in the face of complex real-world challenges.
ROJul 6, 2020
A Real-Time Receding Horizon Sequence Planner for Disassembly in A Human-Robot Collaboration SettingMeng-Lun Lee, Sara Behdad, Xiao Liang et al.
Product disassembly is a labor-intensive process and is far from being automated. Typically, disassembly is not robust enough to handle product varieties from different shapes, models, and physical uncertainties due to component imperfections, damage throughout component usage, or insufficient product information. To overcome these difficulties and to automate the disassembly procedure through human-robot collaboration without excessive computational cost, this paper proposes a real-time receding horizon sequence planner that distributes tasks between robot and human operator while taking real-time human motion into consideration. The sequence planner aims to address several issues in the disassembly line, such as varying orientations, safety constraints of human operators, uncertainty of human operation, and the computational cost of large number of disassembly tasks. The proposed disassembly sequence planner identifies both the positions and orientations of the to-be-disassembled items, as well as the locations of human operator, and obtains an optimal disassembly sequence that follows disassembly rules and safety constraints for human operation. Experimental tests have been conducted to validate the proposed planner: the robot can locate and disassemble the components following the optimal sequence, and consider explicitly human operator's real-time motion, and collaborate with the human operator without violating safety constraints.
ROJul 6, 2020
Including Image-based Perception in Disturbance Observer for Warehouse DronesZhu Chen, Xiao Liang, Minghui Zheng
Grasping and releasing objects would cause oscillations to delivery drones in the warehouse. To reduce such undesired oscillations, this paper treats the to-be-delivered object as an unknown external disturbance and presents an image-based disturbance observer (DOB) to estimate and reject such disturbance. Different from the existing DOB technique that can only compensate for the disturbance after the oscillations happen, the proposed image-based one incorporates image-based disturbance prediction into the control loop to further improve the performance of the DOB. The proposed image-based DOB consists of two parts. The first one is deep-learning-based disturbance prediction. By taking an image of the to-be-delivered object, a sequential disturbance signal is predicted in advance using a connected pre-trained convolutional neural network (CNN) and a long short-term memory (LSTM) network. The second part is a conventional DOB in the feedback loop with a feedforward correction, which utilizes the deep learning prediction to generate a learning signal. Numerical studies are performed to validate the proposed image-based DOB regarding oscillation reduction for delivery drones during the grasping and releasing periods of the objects.
CVMar 2, 2020
Vehicle-Human Interactive Behaviors in Emergency: Data Extraction from Traffic Accident VideosWansong Liu, Danyang Luo, Changxu Wu et al.
Currently, studying the vehicle-human interactive behavior in the emergency needs a large amount of datasets in the actual emergent situations that are almost unavailable. Existing public data sources on autonomous vehicles (AVs) mainly focus either on the normal driving scenarios or on emergency situations without human involvement. To fill this gap and facilitate related research, this paper provides a new yet convenient way to extract the interactive behavior data (i.e., the trajectories of vehicles and humans) from actual accident videos that were captured by both the surveillance cameras and driving recorders. The main challenge for data extraction from real-time accident video lies in the fact that the recording cameras are un-calibrated and the angles of surveillance are unknown. The approach proposed in this paper employs image processing to obtain a new perspective which is different from the original video's perspective. Meanwhile, we manually detect and mark object feature points in each image frame. In order to acquire a gradient of reference ratios, a geometric model is implemented in the analysis of reference pixel value, and the feature points are then scaled to the object trajectory based on the gradient of ratios. The generated trajectories not only restore the object movements completely but also reflect changes in vehicle velocity and rotation based on the feature points distributions.