ROMay 27
Simultaneous Contact Selection and Planning for Contact-Rich Manipulation with Cascaded OptimizationZhe Zhang, Xingrong Diao, Haoxiang Liang et al.
We propose an optimization-based framework for robust contact-rich manipulation. Recent contact-implicit methods enable online hybrid planning across contact modes, allowing closed-loop manipulation for a given target state and contact location sequence of the robot and object. However, most existing approaches lack the ability to autonomously reason and generate diverse contact location sequences and manipulation trajectories, i.e., active contact location selection, which limits their applicability to relatively simple tasks. Active contact location selection is challenging due to complementarity in contact dynamics and the sparse gradients, making the design of a unified framework for contact selection and planning difficult. To address these challenges, we introduce Simultaneous Contact Selection and Planning (SCSP), a cascaded optimization framework comprising Contact Selection Optimization (CSO) and Contact Planning Optimization (CPO). CSO leverages a surrogate contact model and discrete-continuous optimization to efficiently resolve the nonsmoothness and coupling in contact selection, enabling online global searching of optimal contact locations. CPO performs prior-guided contact planning by evaluating the reference contact locations produced by CSO and generating corresponding manipulation trajectories in real time for redundant manipulators. Extensive simulations and real-world experiments demonstrate that SCSP produces diverse manipulation behaviors and robust control under inaccurate dynamics and perceptual noise. We further validate the generalization of the framework on challenging manipulation tasks. Project website: \href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}.
ROMay 27
A Digital Twin Framework for Virtual Visuo-Haptic Teleoperation of Complex-Shaped Optical MicrorobotsZongcai Tan, Lan Wei, Dandan Zhang
Optical tweezers (OT) provide piconewton-scale manipulation for delicate biomedical tasks, where visuo-haptic feedback can improve operator awareness by conveying interaction-force cues and trap-stability information. However, visuo-haptic teleoperation frameworks for complex-shaped optical microrobots remain underdeveloped, particularly in multi-trap manipulation scenarios. This paper presents a digital twin framework for virtual visuo-haptic teleoperation of complex-shaped OT-driven microrobots. The framework integrates a digital twin environment, image-based pose and depth estimation, microrobot motion simulation, and model-based haptic rendering within a Robot Operating System (ROS)-connected bimanual teleoperation system. For force modeling, we combine a Multi-Sphere Distributed Manipulation (MSDM) model with optical-force estimation from the Optical Tweezers Toolbox, enabling simulator-driven visuo-haptic feedback. The framework reproduces representative microrobot motion trends and provides haptic force rendering that is numerically consistent with the fitted optical-force model. In simulated cell-delivery tasks, haptic feedback reduced the standard deviations of the contact-force metric and the microrobot-to-trap-center distance metric by 53.2% and 55.2%, respectively, and improved task success from 30% to 80%. These results demonstrate the framework's effectiveness for evaluating visuo-haptic teleoperation strategies for complex-shaped optical microrobots.
ROMay 27
Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot ManipulationYongchen Wang, Kangyi Lu, Lan Wei et al.
Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.
ROJul 26, 2023
Attention for Robot Touch: Tactile Saliency Prediction for Robust Sim-to-Real Tactile ControlYijiong Lin, Mauro Comi, Alex Church et al.
High-resolution tactile sensing can provide accurate information about local contact in contact-rich robotic tasks. However, the deployment of such tasks in unstructured environments remains under-investigated. To improve the robustness of tactile robot control in unstructured environments, we propose and study a new concept: \textit{tactile saliency} for robot touch, inspired by the human touch attention mechanism from neuroscience and the visual saliency prediction problem from computer vision. In analogy to visual saliency, this concept involves identifying key information in tactile images captured by a tactile sensor. While visual saliency datasets are commonly annotated by humans, manually labelling tactile images is challenging due to their counterintuitive patterns. To address this challenge, we propose a novel approach comprised of three interrelated networks: 1) a Contact Depth Network (ConDepNet), which generates a contact depth map to localize deformation in a real tactile image that contains target and noise features; 2) a Tactile Saliency Network (TacSalNet), which predicts a tactile saliency map to describe the target areas for an input contact depth map; 3) and a Tactile Noise Generator (TacNGen), which generates noise features to train the TacSalNet. Experimental results in contact pose estimation and edge-following in the presence of distractors showcase the accurate prediction of target features from real tactile images. Overall, our tactile saliency prediction approach gives robust sim-to-real tactile control in environments with unknown distractors. Project page: https://sites.google.com/view/tactile-saliency/.
ROApr 11, 2022
Deep Reinforcement Learning Based Semi-Autonomous Control for Robotic SurgeryRuiqi Zhu, Dandan Zhang, Benny Lo
In recent decades, the tremendous benefits surgical robots have brought to surgeons and patients have been witnessed. With the dexterous operation and the great precision, surgical robots can offer patients less recovery time and less hospital stay. However, the controls for current surgical robots in practical usage are fully carried out by surgeons via teleoperation. During the surgery process, there exists a lot of repetitive but simple manipulation, which can cause unnecessary fatigue to the surgeons. In this paper, we proposed a deep reinforcement learning-based semi-autonomous control framework for robotic surgery. The user study showed that the framework can reduce the completion time by 19.1% and the travel length by 58.7%.
SYApr 12
i-Tac: Inverse Design of 3D-Printed Tactile Elastomers with Scalable and Tunable Optical and Mechanical PropertiesWen Fan, Dandan Zhang
Elastomers are central to vision-based tactile sensors (VBTSs), where they transduce external contact into observable deformation. Different VBTS architectures, however, require distinct optical and mechanical properties, particularly transparency and hardness. Conventional elastomer design relies on a forward, trial-and-error optimisation process from material preparation to property evaluation, which is inefficient and offers limited property scalability and target tunability. In this work, we present i-Tac, an inverse design pipeline for tailoring 3D-printed tactile elastomers with target optical and mechanical properties. Inspired by the composite structure of the human dermis, i-Tac exploits multi-material PolyJet additive manufacturing with three complementary resins. A mixture design methodology is employed to characterise the printed elastomers and establish response surface models (ReSMs) that map material compositions to functional properties, thereby defining a scalable property space. Based on user-defined targets, a desirability-function-based multi-objective optimisation is then performed to identify feasible composition regions and derive an optimal operating window for fabrication. This enables elastomers with desired properties to be manufactured in a single iteration, thereby achieving efficient target tunability. Experimental results validate the proposed i-Tac framework in terms of both property scalability and inverse design performance, showing that i-Tac can effectively tailor elastomer transparency and hardness while reducing the iterative burden of conventional forward design. By fabricating physical sensor samples from both commercial and custom designs, the proposed framework further demonstrates the potential of inverse-designed, monolithically manufactured elastomers for customisable VBTS fabrication.
CLFeb 5
Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion DecodingYanzheng Xiang, Lan Wei, Yizhen Yao et al.
Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
CVJul 31, 2023
Hierarchical Semi-Supervised Learning Framework for Surgical Gesture Segmentation and Recognition Based on Multi-Modality DataZhili Yuan, Jialin Lin, Dandan Zhang
Segmenting and recognizing surgical operation trajectories into distinct, meaningful gestures is a critical preliminary step in surgical workflow analysis for robot-assisted surgery. This step is necessary for facilitating learning from demonstrations for autonomous robotic surgery, evaluating surgical skills, and so on. In this work, we develop a hierarchical semi-supervised learning framework for surgical gesture segmentation using multi-modality data (i.e. kinematics and vision data). More specifically, surgical tasks are initially segmented based on distance characteristics-based profiles and variance characteristics-based profiles constructed using kinematics data. Subsequently, a Transformer-based network with a pre-trained `ResNet-18' backbone is used to extract visual features from the surgical operation videos. By combining the potential segmentation points obtained from both modalities, we can determine the final segmentation points. Furthermore, gesture recognition can be implemented based on supervised learning. The proposed approach has been evaluated using data from the publicly available JIGSAWS database, including Suturing, Needle Passing, and Knot Tying tasks. The results reveal an average F1 score of 0.623 for segmentation and an accuracy of 0.856 for recognition.
LGAug 2, 2024
TCR-GPT: Integrating Autoregressive Model and Reinforcement Learning for T-Cell Receptor Repertoires GenerationYicheng Lin, Dandan Zhang, Yun Liu
T-cell receptors (TCRs) play a crucial role in the immune system by recognizing and binding to specific antigens presented by infected or cancerous cells. Understanding the sequence patterns of TCRs is essential for developing targeted immune therapies and designing effective vaccines. Language models, such as auto-regressive transformers, offer a powerful solution to this problem by learning the probability distributions of TCR repertoires, enabling the generation of new TCR sequences that inherit the underlying patterns of the repertoire. We introduce TCR-GPT, a probabilistic model built on a decoder-only transformer architecture, designed to uncover and replicate sequence patterns in TCR repertoires. TCR-GPT demonstrates an accuracy of 0.953 in inferring sequence probability distributions measured by Pearson correlation coefficient. Furthermore, by leveraging Reinforcement Learning(RL), we adapted the distribution of TCR sequences to generate TCRs capable of recognizing specific peptides, offering significant potential for advancing targeted immune therapies and vaccine development. With the efficacy of RL, fine-tuned pretrained TCR-GPT models demonstrated the ability to produce TCR repertoires likely to bind specific peptides, illustrating RL's efficiency in enhancing the model's adaptability to the probability distributions of biologically relevant TCR sequences.
SYMar 15
Context-Aware Adaptive Shared Control for Magnetically-Driven Bimanual Dexterous MicromanipulationYongchen Wang, Kangyi Lu, Lan Wei et al.
Magnetically actuated robots provide a promising untethered platform for navigation in confined environments, enabling biological studies and targeted micro-delivery. However, dexterous manipulation in complex structures remains challenging. While single-arm magnetic actuation suffices for simple transport, steering through tortuous or bifurcating channels demands coordinated control of multiple magnetic sources to generate the torques required for precise rotation and directional guidance. Bimanual teleoperation enables such dexterous steering but imposes high cognitive demands, as operators must handle the nonlinear dynamics of magnetic actuation while coordinating two robotic manipulators. To address these limitations, we propose Bi-CAST, a context-aware adaptive shared control framework for bimanual magnetic micromanipulation. A multimodal network fuses spatio-temporal visual features, spatial risk metrics, and historical states to continuously adjust the control authority of each manipulator in real time. In parallel, a bidirectional haptic interface integrates force-based intent recognition with risk-aware guidance, enabling force feedback to provide a continuous channel for dynamic human-machine authority negotiation. We validate the framework through user studies with eight participants performing three navigation tasks of increasing complexity in a vascular phantom. Compared with fixed authority and discrete switching baselines, Bi-CAST achieves up to 76.6% reduction in collisions, 25.9% improvement in trajectory smoothness, and 44.4% lower NASA-TLX workload, while delivering the fastest task completion times.
ROApr 13
Dual-Control Frequency-Aware Diffusion Model for Depth-Dependent Optical Microrobot Microscopy Image GenerationLan Wei, Zongcai Tan, Kangyi Lu et al.
Optical microrobots actuated by optical tweezers (OT) are important for cell manipulation and microscale assembly, but their autonomous operation depends on accurate 3D perception. Developing such perception systems is challenging because large-scale, high-quality microscopy datasets are scarce, owing to complex fabrication processes and labor-intensive annotation. Although generative AI offers a promising route for data augmentation, existing generative adversarial network (GAN)-based methods struggle to reproduce key optical characteristics, particularly depth-dependent diffraction and defocus effects. To address this limitation, we propose Du-FreqNet, a dual-control, frequency-aware diffusion model for physically consistent microscopy image synthesis. The framework features two independent ControlNet branches to encode microrobot 3D point clouds and depth-specific mesh layers, respectively. We introduce an adaptive frequency-domain loss that dynamically reweights high- and low-frequency components based on the distance to the focal plane. By leveraging differentiable FFT-based supervision, Du-FreqNet captures physically meaningful frequency distributions often missed by pixel-space methods. Trained on a limited dataset (e.g., 80 images per pose), our model achieves controllable, depth-dependent image synthesis, improving SSIM by 20.7% over baselines. Extensive experiments demonstrate that Du-FreqNet generalizes effectively to unseen poses and significantly enhances downstream tasks, including 3D pose and depth estimation, thereby facilitating robust closed-loop control in microrobotic systems.
ROApr 13
Micro-Dexterity in Biological Micromanipulation: Embodiment, Perception, and ControlKangyi Lu, Lan Wei, Zongcai Tan et al.
Microscale manipulation has advanced substantially in controlled locomotion and targeted transport, yet many biomedical applications require precise and adaptive interaction with biological micro-objects. At these scales, manipulation is realized through three main classes of platforms: embodied microrobots that physically interact as mobile agents, field-mediated systems that generate contactless trapping or manipulation forces, and externally actuated end-effectors that interact through remotely driven physical tools. Unlike macroscale manipulators, these systems function in fluidic, confined, and surface-dominated environments characterized by negligible inertia, dominant interfacial forces, and soft, heterogeneous, and fragile targets. Consequently, classical assumptions of dexterous manipulation, including rigid-body contact, stable grasping, and rich proprioceptive feedback, become difficult to maintain. This review introduces micro-dexterity as a framework for analyzing biological micromanipulation through the coupled roles of embodiment, perception, and control. We examine how classical manipulation primitives, including pushing, reorientation, grasping, and cooperative manipulation, are reformulated at the microscale; compare the architectures that enable them, from contact-based micromanipulators to contactless field-mediated systems and cooperative multi-agent platforms; and review the perception and control strategies required for task execution. We identify the current dexterity gap between laboratory demonstrations and clinically relevant biological manipulation, and outline key challenges for future translation.
CVFeb 22
MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact PoseSirine Bhouri, Lan Wei, Jian-Qing Zheng et al.
Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.
SYMar 15
Surgi-HDTMR: Closing the Sensorimotor Loop in Bimanual Microsurgery via Haptics, Digital Twin, and Mixed RealitySongming Ping, Shaoyue Wen, Junhong Chen et al.
Robotic microsurgery demands precise bimanual control, intuitive interaction, and informative force feedback. However, most training platforms for robotic microsurgery lack immersive 3D interaction and high-fidelity haptics. Here, we present Surgi-HDTMR, a mixed-reality (MR) and digital-twin (DT) training system that couples bimanual haptic teleoperation with a benchtop microsurgical robotic platform, and 3D-printed phantoms. A metrically co-registered, time-synchronized DT aligns in-situ MR guidance with the physical workspace and drives a depth-adaptive haptic model that renders contact, puncture, and tissue-retraction forces. In a within-subjects study of simulated cortical navigation and tumor resection, Surgi-HDTMR shortened task time, reduced harmful contacts and collisions, and improved perceptual accuracy relative to non-haptic and non-adaptive baselines. These results suggest that tightly coupling MR overlays with a synchronized DT, together with depth-adaptive haptics, can accelerate skill acquisition and improve safety in robot-assisted microsurgery, pointing toward next-generation surgical training.
SYMar 15
DexterousMag: A Reconfigurable Electromagnetic Actuation System for Miniature Helical RobotJialin Lin, Dandan Zhang
Despite the promise of magnetically actuated miniature helical robots for minimally invasive interventions, state-of-the-art electromagnetic actuation systems are often space-inefficient and geometrically fixed. These constraints hinder clinical translation and, moreover, prevent task-adaptive trade-offs among workspace coverage, energy distribution, and field/gradient capability. We present DexterousMag, a robot-arm-assisted three-coil electromagnetic actuation system that enables continuous geometric reconfiguration of a compact coil group, thereby redistributing magnetic-field and gradient capability for task-adaptive operation. The reconfiguration is realized by a parallel mechanism that exposes a single geometric DOF of the coil group, conveniently parameterized by the polar angle. Using an FEM-based modeling pipeline, we precompute actuation and gradient libraries and quantify the resulting trade-offs under current limits: configurations that favor depth reach expand the feasible region but reduce peak field/gradient, whereas configurations that favor near-surface capability concentrate stronger fields/gradients and support lifting. We validate these trade-offs on representative tasks (deep translation, planar tracking, and 3D lifting) and further demonstrate a proof-of-concept online geometry scheduling scheme for combined tasks, benchmarked against fixed-geometry settings. Overall, DexterousMag establishes continuous geometric reconfiguration as an operational mechanism for enlarging the practical envelope of miniature helical robot actuation while improving energy efficiency and safety.
CVDec 1, 2025
SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile PerceptionGurmeher Khurana, Lan Wei, Dandan Zhang
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
ROMar 12, 2025
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction FrameworkJian-Jian Jiang, Xiao-Ming Wu, Yi-Xiang He et al.
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.
ROMay 9, 2025
Multi-Agent Systems for Robotic Autonomy with LLMsJunhong Chen, Ziqi Yang, Haoyuan G Xu et al.
Since the advent of Large Language Models (LLMs), various research based on such models have maintained significant academic attention and impact, especially in AI and robotics. In this paper, we propose a multi-agent framework with LLMs to construct an integrated system for robotic task analysis, mechanical design, and path generation. The framework includes three core agents: Task Analyst, Robot Designer, and Reinforcement Learning Designer. Outputs are formatted as multimodal results, such as code files or technical reports, for stronger understandability and usability. To evaluate generalizability comparatively, we conducted experiments with models from both GPT and DeepSeek. Results demonstrate that the proposed system can design feasible robots with control strategies when appropriate task inputs are provided, exhibiting substantial potential for enhancing the efficiency and accessibility of robotic system development in research and industrial applications.
ROJul 29, 2025
Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving CompetitionRuiyang Hao, Haibao Yu, Jiaru Zhong et al.
With the rapid advancement of autonomous driving technology, vehicle-to-everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety by providing visibility beyond the line of sight. However, integrating multi-source sensor data from both ego-vehicles and infrastructure under real-world constraints, such as limited communication bandwidth and dynamic environments, presents significant technical challenges. To facilitate research in this area, we organized the End-to-End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end-to-end planning. Built on the UniV2X framework and the V2X-Seq-SPD dataset, the challenge attracted participation from over 30 teams worldwide and established a unified benchmark for evaluating cooperative driving systems. This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth-aware fusion, robust multi-agent planning, and heterogeneous sensor integration, and analyzes emerging technical trends among top-performing solutions. By addressing practical constraints in communication and data fusion, the challenge contributes to the development of scalable and reliable V2X-cooperative autonomous driving systems.
LGNov 13, 2024
Evaluating Molecule Synthesizability via Retrosynthetic Planning and Reaction PredictionSongtao Liu, Dandan Zhang, Zhengkai Tu et al.
A significant challenge in wet lab experiments with current drug design generative models is the trade-off between pharmacological properties and synthesizability. Molecules predicted to have highly desirable properties are often difficult to synthesize, while those that are easily synthesizable tend to exhibit less favorable properties. As a result, evaluating the synthesizability of molecules in general drug design scenarios remains a significant challenge in the field of drug discovery. The commonly used synthetic accessibility (SA) score aims to evaluate the ease of synthesizing generated molecules, but it falls short of guaranteeing that synthetic routes can actually be found. Inspired by recent advances in top-down synthetic route generation and forward reaction prediction, we propose a new, data-driven metric to evaluate molecule synthesizability. This novel metric leverages the synergistic duality between retrosynthetic planners and reaction predictors, both of which are trained on extensive reaction datasets. To demonstrate the efficacy of our metric, we conduct a comprehensive evaluation of round-trip scores across a range of representative molecule generative models.
AIMay 27, 2025
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language ModelsZesen Lyu, Dandan Zhang, Wei Ye et al.
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.
LGMay 20, 2024
Adaptive Convolutional Forecasting Network Based on Time Series Feature-DrivenDandan Zhang, Zhiqiang Zhang, Nanguang Chen et al.
Time series data in real-world scenarios contain a substantial amount of nonlinear information, which significantly interferes with the training process of models, leading to decreased prediction performance. Therefore, during the time series forecasting process, extracting the local and global time series patterns and understanding the potential nonlinear features among different time observations are highly significant. To address this challenge, we introduce multi-resolution convolution and deformable convolution operations. By enlarging the receptive field using convolution kernels with different dilation factors to capture temporal correlation information at different resolutions, and adaptively adjusting the sampling positions through additional offset vectors, we enhance the network's ability to capture potential nonlinear features among time observations. Building upon this, we propose ACNet, an adaptive convolutional network designed to effectively model the local and global temporal dependencies and the nonlinear features between observations in multivariate time series. Specifically, by extracting and fusing time series features at different resolutions, we capture both local contextual information and global patterns in the time series. The designed nonlinear feature adaptive extraction module captures the nonlinear features among different time observations in the time series. We evaluated the performance of ACNet across twelve real-world datasets. The results indicate that ACNet consistently achieves state-of-the-art performance in both short-term and long-term forecasting tasks with favorable runtime efficiency.
CVNov 20, 2025
Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose EstimationZongcai Tan, Lan Wei, Dandan Zhang
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
OPTICSOct 15, 2025
Optical Computation-in-Communication enables low-latency, high-fidelity perception in telesurgeryRui Yang, Jiaming Hu, Jian-Qing Zheng et al.
Artificial intelligence (AI) holds significant promise for enhancing intraoperative perception and decision-making in telesurgery, where physical separation impairs sensory feedback and control. Despite advances in medical AI and surgical robotics, conventional electronic AI architectures remain fundamentally constrained by the compounded latency from serial processing of inference and communication. This limitation is especially critical in latency-sensitive procedures such as endovascular interventions, where delays over 200 ms can compromise real-time AI reliability and patient safety. Here, we introduce an Optical Computation-in-Communication (OCiC) framework that reduces end-to-end latency significantly by performing AI inference concurrently with optical communication. OCiC integrates Optical Remote Computing Units (ORCUs) directly into the optical communication pathway, with each ORCU experimentally achieving up to 69 tera-operations per second per channel through spectrally efficient two-dimensional photonic convolution. The system maintains ultrahigh inference fidelity within 0.1% of CPU/GPU baselines on classification and coronary angiography segmentation, while intrinsically mitigating cumulative error propagation, a longstanding barrier to deep optical network scalability. We validated the robustness of OCiC through outdoor dark fibre deployments, confirming consistent and stable performance across varying environmental conditions. When scaled globally, OCiC transforms long-haul fibre infrastructure into a distributed photonic AI fabric with exascale potential, enabling reliable, low-latency telesurgery across distances up to 10,000 km and opening a new optical frontier for distributed medical intelligence.
ROMay 27, 2025
Interactive OT Gym: A Reinforcement Learning-Based Interactive Optical tweezer (OT)-Driven Microrobotics Simulation PlatformZongcai Tan, Dandan Zhang
Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complex-shaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field simulations and integrates haptic feedback interfaces, RL modules, and context-aware shared control strategies tailored for OT-driven microrobot in cooperative biological object manipulation tasks. This integration allows for an adaptive blend of manual and autonomous control, enabling seamless transitions between human input and autonomous operation. We evaluated the effectiveness of our platform using a cell manipulation task. Experimental results show that our shared control system significantly improves micromanipulation performance, reducing task completion time by approximately 67% compared to using pure human or RL control alone and achieving a 100% success rate. With its high fidelity, interactivity, low cost, and high-speed simulation capabilities, Interactive OT Gym serves as a user-friendly training and testing environment for the development of advanced interactive OT-driven micromanipulation systems and control algorithms. For more details on the project, please see our website https://sites.google.com/view/otgym
IVMar 31, 2025
Coarse-to-Fine Learning for Multi-Pipette Localisation in Robot-Assisted In Vivo Patch-ClampLan Wei, Gema Vera Gonzalez, Phatsimo Kgwarae et al.
In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex vivo experiments or single pipette use, making them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to facilitate multi-pipette real-time localisation for robot-assisted in vivo patch-clamp. More specifically, we introduce a Generative Adversarial Network (GAN)-based module to remove background noise and enhance pipette visibility. We then introduce a two-stage Transformer model that starts with predicting the coarse heatmap of the pipette tips, followed by the fine-grained coordination regression module for precise tip localisation. To ensure robust training, we use the Hungarian algorithm for optimal matching between the predicted and actual locations of tips. Experimental results demonstrate that our method achieved > 98% accuracy within 10 μm, and > 89% accuracy within 5 μm for the localisation of multi-pipette tips. The average MSE is 2.52 μm.
ROMay 16, 2021
Explainable Hierarchical Imitation Learning for Robotic Drink PouringDandan Zhang, Yu Zheng, Qiang Li et al.
To accurately pour drinks into various containers is an essential skill for service robots. However, drink pouring is a dynamic process and difficult to model. Traditional deep imitation learning techniques for implementing autonomous robotic pouring have an inherent black-box effect and require a large amount of demonstration data for model training. To address these issues, an Explainable Hierarchical Imitation Learning (EHIL) method is proposed in this paper such that a robot can learn high-level general knowledge and execute low-level actions across multiple drink pouring scenarios. Moreover, with EHIL, a logical graph can be constructed for task execution, through which the decision-making process for action generation can be made explainable to users and the causes of failure can be traced out. Based on the logical graph, the framework is manipulable to achieve different targets while the adaptability to unseen scenarios can be achieved in an explainable manner. A series of experiments have been conducted to verify the effectiveness of the proposed method. Results indicate that EHIL outperforms the traditional behavior cloning method in terms of success rate, adaptability, manipulability and explainability.
ROMay 15, 2021
An Ergonomic Interaction Workspace Analysis Method for the Optimal Design of a Surgical Master ManipulatorDandan Zhang, Jindong Liu, Guangzhong Yang
Master control console is a place where robots collaborate with humans in a shared environment. To this end, ergonomics is an important aspect to be considered. With ergonomic design, the surgeons can feel more comfortable to conduct the surgical tasks with higher efficiency, and the quality of the teleoperated robotic surgery can be improved. In this paper, an Ergonomic Interaction Workspace Analysis method is proposed to optimize master manipulators and fulfil ergonomics consideration for designing a master manipulator for teleoperated robotic surgery.
CVMay 2, 2021
Surgical Gesture Recognition Based on Bidirectional Multi-Layer Independently RNN with Explainable Spatial Feature ExtractionDandan Zhang, Ruoxi Wang, Benny Lo
Minimally invasive surgery mainly consists of a series of sub-tasks, which can be decomposed into basic gestures or contexts. As a prerequisite of autonomic operation, surgical gesture recognition can assist motion planning and decision-making, and build up context-aware knowledge to improve the surgical robot control quality. In this work, we aim to develop an effective surgical gesture recognition approach with an explainable feature extraction process. A Bidirectional Multi-Layer independently RNN (BML-indRNN) model is proposed in this paper, while spatial feature extraction is implemented via fine-tuning of a Deep Convolutional Neural Network(DCNN) model constructed based on the VGG architecture. To eliminate the black-box effects of DCNN, Gradient-weighted Class Activation Mapping (Grad-CAM) is employed. It can provide explainable results by showing the regions of the surgical images that have a strong relationship with the surgical gesture classification results. The proposed method was evaluated based on the suturing task with data obtained from the public available JIGSAWS database. Comparative studies were conducted to verify the proposed framework. Results indicated that the testing accuracy for the suturing task based on our proposed method is 87.13%, which outperforms most of the state-of-the-art algorithms.
IVNov 8, 2020
Real-time Surgical Environment Enhancement for Robot-Assisted Minimally Invasive Surgery Based on Super-ResolutionRuoxi Wang, Dandan Zhang, Qingbiao Li et al.
In Robot-Assisted Minimally Invasive Surgery (RAMIS), a camera assistant is normally required to control the position and zooming ratio of the laparoscope, following the surgeon's instructions. However, moving the laparoscope frequently may lead to unstable and suboptimal views, while the adjustment of zooming ratio may interrupt the workflow of the surgical operation. To this end, we propose a multi-scale Generative Adversarial Network (GAN)-based video super-resolution method to construct a framework for automatic zooming ratio adjustment. It can provide automatic real-time zooming for high-quality visualization of the Region Of Interest (ROI) during the surgical operation. In the pipeline of the framework, the Kernel Correlation Filter (KCF) tracker is used for tracking the tips of the surgical tools, while the Semi-Global Block Matching (SGBM) based depth estimation and Recurrent Neural Network (RNN)-based context-awareness are developed to determine the upscaling ratio for zooming. The framework is validated with the JIGSAW dataset and Hamlyn Centre Laparoscopic/Endoscopic Video Datasets, with results demonstrating its practicability.
ROJul 2, 2020
A Learning-Driven Framework with Spatial Optimization For Surgical Suture Thread Reconstruction and Autonomous Grasping Under Multiple Topologies and Environmental NoisesBo Lu, Wei Chen, Yue-Ming Jin et al.
Surgical knot tying is one of the most fundamental and important procedures in surgery, and a high-quality knot can significantly benefit the postoperative recovery of the patient. However, a longtime operation may easily cause fatigue to surgeons, especially during the tedious wound closure task. In this paper, we present a vision-based method to automate the suture thread grasping, which is a sub-task in surgical knot tying and an intermediate step between the stitching and looping manipulations. To achieve this goal, the acquisition of a suture's three-dimensional (3D) information is critical. Towards this objective, we adopt a transfer-learning strategy first to fine-tune a pre-trained model by learning the information from large legacy surgical data and images obtained by the on-site equipment. Thus, a robust suture segmentation can be achieved regardless of inherent environment noises. We further leverage a searching strategy with termination policies for a suture's sequence inference based on the analysis of multiple topologies. Exact results of the pixel-level sequence along a suture can be obtained, and they can be further applied for a 3D shape reconstruction using our optimized shortest path approach. The grasping point considering the suturing criterion can be ultimately acquired. Experiments regarding the suture 2D segmentation and ordering sequence inference under environmental noises were extensively evaluated. Results related to the automated grasping operation were demonstrated by simulations in V-REP and by robot experiments using Universal Robot (UR) together with the da Vinci Research Kit (dVRK) adopting our learning-driven framework.