ROOct 13, 2022Code
ROS-PyBullet Interface: A Framework for Reliable Contact Simulation and Human-Robot InteractionChristopher E. Mower, Theodoros Stouraitis, João Moura et al.
Reliable contact simulation plays a key role in the development of (semi-)autonomous robots, especially when dealing with contact-rich manipulation scenarios, an active robotics research topic. Besides simulation, components such as sensing, perception, data collection, robot hardware control, human interfaces, etc. are all key enablers towards applying machine learning algorithms or model-based approaches in real world systems. However, there is a lack of software connecting reliable contact simulation with the larger robotics ecosystem (i.e. ROS, Orocos), for a more seamless application of novel approaches, found in the literature, to existing robotic hardware. In this paper, we present the ROS-PyBullet Interface, a framework that provides a bridge between the reliable contact/impact simulator PyBullet and the Robot Operating System (ROS). Furthermore, we provide additional utilities for facilitating Human-Robot Interaction (HRI) in the simulated environment. We also present several use-cases that highlight the capabilities and usefulness of our framework. Please check our video, source code, and examples included in the supplementary material. Our full code base is open source and can be found at https://github.com/cmower/ros_pybullet_interface.
CVMar 25, 2022Code
Multi-scale and Cross-scale Contrastive Learning for Semantic SegmentationTheodoros Pissas, Claudio S. Ravasio, Lyndon Da Cruz et al.
This work considers supervised contrastive learning for semantic segmentation. We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks. Our key methodological insight is to leverage samples from the feature spaces emanating from multiple stages of a model's encoder itself requiring neither data augmentation nor online memory banks to obtain a diverse set of samples. To allow for such an extension we introduce an efficient and effective sampling process, that enables applying contrastive losses over the encoder's features at multiple scales. Furthermore, by first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint by introducing cross-scale contrastive learning linking high-resolution local features to low-resolution global features. Combined, our multi-scale and cross-scale contrastive losses boost performance of various models (DeepLabV3, HRNet, OCRNet, UPerNet) with both CNN and Transformer backbones, when evaluated on 4 diverse datasets from natural (Cityscapes, PascalContext, ADE20K) but also surgical (CaDIS) domains. Our code is available at https://github.com/RViMLab/MS_CS_ContrSeg. datasets from natural (Cityscapes, PascalContext, ADE20K) but also surgical (CaDIS) domains.
43.4ROMay 23Code
Geometric Workspace Analysis and Transmission-Aware Dynamics of a Serial Spherical Tool for MicrosurgeryAnestis Mablekos-Alexiou, Lyndon da Cruz, Christos Bergeles
We present a kinematic and transmission-aware design framework for a serial spherical mechanism with an additional translational degree of freedom for microsurgery. The first contribution is an analytical workspace formulation that provides geometric insight into reachable motion and enables rapid selection of rotation axis orientations without numerical optimization. The second contribution is a dynamics-informed methodology for mechanisms driven by self-locking transmissions, supporting evaluation of torque requirements for a prescribed workspace geometry. The framework is accompanied by an open-source software package for friction identification and inverse dynamics analysis. Experiments on a purpose-built robotic tool for vitreoretinal surgery validate the predictive capability of the models and demonstrate their practical utility for engineering design.
49.9ROMay 21
Remote Teleoperation of Endovascular Intervention Robots: A Systematic ReviewXingyu Chen, Yinchao Yang, Nikola Fischer et al.
Remote robotic-assisted endovascular intervention offers a promising approach to reduce clinician radiation exposure and physical strain, while extending specialized vascular care to geographically distant regions. Despite advancements, teleoperated endovascular intervention remains underexplored, especially for time-sensitive interventions like mechanical thrombectomy for acute stroke. The aim of the current review was to determine the evidence regarding teleoperated endovascular robotic systems, covering technical feasibility, communication infrastructure, and clinical outcomes. The review further identified research gaps and future directions. Following PRISMA guidelines, 16 studies were included that met the inclusion criteria out of 2501 initial search results. We found that teleoperated catheters and guidewires, driven by mechanical or electromagnetic systems, can be navigated across distances up to 7000 km. With robust communication infrastructure, network latency remained within clinically acceptable limits (30-163 ms). Although initial outcomes highlighted 100% procedural success in small-scale human trials, most evidence stemmed from animal or phantom models. Overall, the findings suggest that teleoperated endovascular intervention can reduce occupational hazards, expand patient access to urgent procedures, and optimize resource allocation. Future research should be conducted in low and middle income countries to demonstrate broader geographical access. Ultimately, multi-center clinical trials are required to validate the safety, efficacy, and generalization in diverse clinical settings.
CVOct 11, 2022
Oflib: Facilitating Operations with and on Optical Flow Fields in PythonClaudio Ravasio, Lyndon Da Cruz, Christos Bergeles
We present a robust theoretical framework for the characterisation and manipulation of optical flow, i.e 2D vector fields, in the context of their use in motion estimation algorithms and beyond. The definition of two frames of reference guides the mathematical derivation of flow field application, inversion, evaluation, and composition operations. This structured approach is then used as the foundation for an implementation in Python 3, with the fully differentiable PyTorch version oflibpytorch supporting back-propagation as required for deep learning. We verify the flow composition method empirically and provide a working example for its application to optical flow ground truth in synthetic training data creation. All code is publicly available.
49.4ROApr 22
Toward Safe Autonomous Robotic Endovascular Interventions using World ModelsHarry Robertshaw, Nikola Fischer, Han-Ru Wu et al.
Autonomous mechanical thrombectomy (MT) presents substantial challenges due to highly variable vascular geometries and the requirements for accurate, real-time control. While reinforcement learning (RL) has emerged as a promising paradigm for the automation of endovascular navigation, existing approaches often show limited robustness when faced with diverse patient anatomies or extended navigation horizons. In this work, we investigate a world-model-based framework for autonomous endovascular navigation built on TD-MPC2, a model-based RL method that integrates planning and learned dynamics. We evaluate a TD-MPC2 agent trained on multiple navigation tasks across hold out patient-specific vasculatures and benchmark its performance against the state-of-the-art Soft Actor-Critic (SAC) algorithm agent. Both approaches are further validated in vitro using patient-specific vascular phantoms under fluoroscopic guidance. In simulation, TD-MPC2 demonstrates a significantly higher mean success rate than SAC (58% vs. 36%, p < 0.001), and mean tip contact forces of 0.15 N, well below the proposed 1.5 N vessel rupture threshold. Mean success rates for TD-MPC2 (68%) were comparable to SAC (60%) in vitro, but TD-MPC2 achieved superior path ratios (p = 0.017) at the cost of longer procedure times (p < 0.001). Together, these results provide the first demonstration of autonomous MT navigation validated across both hold out in silico data and fluoroscopy-guided in vitro experiments, highlighting the promise of world models for safe and generalizable AI-assisted endovascular interventions.
ROApr 29, 2025Code
Hydra: Marker-Free RGB-D Hand-Eye CalibrationMartin Huber, Huanyu Tian, Christopher E. Mower et al.
This work presents an RGB-D imaging-based approach to marker-free hand-eye calibration using a novel implementation of the iterative closest point (ICP) algorithm with a robust point-to-plane (PTP) objective formulated on a Lie algebra. Its applicability is demonstrated through comprehensive experiments using three well known serial manipulators and two RGB-D cameras. With only three randomly chosen robot configurations, our approach achieves approximately 90% successful calibrations, demonstrating 2-3x higher convergence rates to the global optimum compared to both marker-based and marker-free baselines. We also report 2 orders of magnitude faster convergence time (0.8 +/- 0.4 s) for 9 robot configurations over other marker-free methods. Our method exhibits significantly improved accuracy (5 mm in task space) over classical approaches (7 mm in task space) whilst being marker-free. The benchmarking dataset and code are open sourced under Apache 2.0 License, and a ROS 2 integration with robot abstraction is provided to facilitate deployment.
CVMar 22, 2024Code
RetiGen: A Framework for Generalized Retinal Diagnosis Using Multi-View Fundus ImagesZe Chen, Gongyu Zhang, Jiayu Huo et al.
This study introduces a novel framework for enhancing domain generalization in medical imaging, specifically focusing on utilizing unlabelled multi-view colour fundus photographs. Unlike traditional approaches that rely on single-view imaging data and face challenges in generalizing across diverse clinical settings, our method leverages the rich information in the unlabelled multi-view imaging data to improve model robustness and accuracy. By incorporating a class balancing method, a test-time adaptation technique and a multi-view optimization strategy, we address the critical issue of domain shift that often hampers the performance of machine learning models in real-world applications. Experiments comparing various state-of-the-art domain generalization and test-time optimization methodologies show that our approach consistently outperforms when combined with existing baseline and state-of-the-art methods. We also show our online method improves all existing techniques. Our framework demonstrates improvements in domain generalization capabilities and offers a practical solution for real-world deployment by facilitating online adaptation to new, unseen datasets. Our code is available at https://github.com/zgy600/RetiGen .
CVAug 13, 2021Code
Effective semantic segmentation in Cataract Surgery: What matters most?Theodoros Pissas, Claudio Ravasio, Lyndon Da Cruz et al.
Our work proposes neural network design choices that set the state-of-the-art on a challenging public benchmark on cataract surgery, CaDIS. Our methodology achieves strong performance across three semantic segmentation tasks with increasingly granular surgical tool class sets by effectively handling class imbalance, an inherent challenge in any surgical video. We consider and evaluate two conceptually simple data oversampling methods as well as different loss functions. We show significant performance gains across network architectures and tasks especially on the rarest tool classes, thereby presenting an approach for achieving high performance when imbalanced granular datasets are considered. Our code and trained models are available at https://github.com/RViMLab/MICCAI2021_Cataract_semantic_segmentation and qualitative results on unseen surgical video can be found at https://youtu.be/twVIPUj1WZM.
CVFeb 27, 2024
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual TasksYang Liu, Xiaomin Yu, Gongyu Zhang et al.
"A data scientist is tasked with developing a low-cost surgical VQA system for a 2-month workshop. Due to data sensitivity, she collects 50 hours of surgical video from a hospital, requiring two months for privacy approvals. Privacy restrictions prevent uploading data to platforms like ChatGPT, so she assembles one annotator and a medical expert to manually create QA pairs. This process takes three weeks and costs over $10,000. The trained model provides accurate responses within the limited data scope but lacks broader generalizability, completing the project in 3 months." To simplify the challenges presented in the scenario above. In this paper, we replace the image input with text for Vision-language training. Inspired by prior noise injection methods to reduce modality gaps, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively broadens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 0.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks.
CVAug 27, 2025
ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical InstrumentsZhe Han, Charlie Budd, Gongyu Zhang et al.
Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.
ROOct 27, 2025
Localising under the drape: proprioception in the era of distributed surgical robotic systemMartin Huber, Nicola A. Cavalcanti, Ayoob Davoodi et al.
Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, issues that will intensify with the introduction of distributed robots with independent interacting arms. Existing tracking systems rely on bulky infrared cameras and reflective markers, providing only limited views of the surgical scene and adding hardware burden in crowded operating rooms. We present a marker-free proprioception method that enables precise localisation of surgical robots under their sterile draping despite associated obstruction of visual cues. Our method solely relies on lightweight stereo-RGB cameras and novel transformer-based deep learning models. It builds on the largest multi-centre spatial robotic surgery dataset to date (1.4M self-annotated images from human cadaveric and preclinical in vivo studies). By tracking the entire robot and surgical scene, rather than individual markers, our approach provides a holistic view robust to occlusions, supporting surgical scene understanding and context-aware control. We demonstrate an example of potential clinical benefits during in vivo breathing compensation with access to tissue dynamics, unobservable under state of the art tracking, and accurately locate in multi-robot systems for future intelligent interaction. In addition, and compared with existing systems, our method eliminates markers and improves tracking visibility by 25%. To our knowledge, this is the first demonstration of marker-free proprioception for fully draped surgical robots, reducing setup complexity, enhancing safety, and paving the way toward modular and autonomous robotic surgery.
CVMar 15, 2024
Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical FlowYang Liu, Peiran Wu, Jiayu Huo et al.
Unsupervised video-based surgical instrument segmentation has the potential to accelerate the adoption of robot-assisted procedures by reducing the reliance on manual annotations. However, the generally low quality of optical flow in endoscopic footage poses a great challenge for unsupervised methods that rely heavily on motion cues. To overcome this limitation, we propose a novel approach that pinpoints motion boundaries, regions with abrupt flow changes, while selectively discarding frames with globally low-quality flow and adapting to varying motion patterns. Experiments on the EndoVis2017 VOS and EndoVis2017 Challenge datasets show that our method achieves mean Intersection-over-Union (mIoU) scores of 0.75 and 0.72, respectively, effectively alleviating the constraints imposed by suboptimal optical flow. This enables a more scalable and robust surgical instrument segmentation solution in clinical settings. The code will be publicly released.
IVOct 21, 2021
2020 CATARACTS Semantic Segmentation ChallengeImanol Luengo, Maria Grammatikopoulou, Rahim Mohammadi et al.
Surgical scene segmentation is essential for anatomy and instrument localization which can be further used to assess tissue-instrument interactions during a surgical procedure. In 2017, the Challenge on Automatic Tool Annotation for cataRACT Surgery (CATARACTS) released 50 cataract surgery videos accompanied by instrument usage annotations. These annotations included frame-level instrument presence information. In 2020, we released pixel-wise semantic annotations for anatomy and instruments for 4670 images sampled from 25 videos of the CATARACTS training set. The 2020 CATARACTS Semantic Segmentation Challenge, which was a sub-challenge of the 2020 MICCAI Endoscopic Vision (EndoVis) Challenge, presented three sub-tasks to assess participating solutions on anatomical structure and instrument segmentation. Their performance was assessed on a hidden test set of 531 images from 10 videos of the CATARACTS test set.
IVSep 30, 2021
Deep Homography Estimation in Dynamic Surgical Scenes for Laparoscopic Camera Motion ExtractionMartin Huber, Sébastien Ourselin, Christos Bergeles et al.
Current laparoscopic camera motion automation relies on rule-based approaches or only focuses on surgical tools. Imitation Learning (IL) methods could alleviate these shortcomings, but have so far been applied to oversimplified setups. Instead of extracting actions from oversimplified setups, in this work we introduce a method that allows to extract a laparoscope holder's actions from videos of laparoscopic interventions. We synthetically add camera motion to a newly acquired dataset of camera motion free da Vinci surgery image sequences through a novel homography generation algorithm. The synthetic camera motion serves as a supervisory signal for camera motion estimation that is invariant to object and tool motion. We perform an extensive evaluation of state-of-the-art (SOTA) Deep Neural Networks (DNNs) across multiple compute regimes, finding our method transfers from our camera motion free da Vinci surgery dataset to videos of laparoscopic interventions, outperforming classical homography estimation approaches in both, precision by 41%, and runtime on a CPU by 43%.