66.8ROMay 28
Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware AdaptationQianyi Zhang, Shijian Ma, Boyi Liu et al.
Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.
ROJul 30, 2022Code
Robust Contact State Estimation in Humanoid Walking GaitsStylianos Piperakis, Michael Maravgakis, Dimitrios Kanoulas et al.
In this article, we propose a deep learning framework that provides a unified approach to the problem of leg contact detection in humanoid robot walking gaits. Our formulation accomplishes to accurately and robustly estimate the contact state probability for each leg (i.e., stable or slip/no contact). The proposed framework employs solely proprioceptive sensing and although it relies on simulated ground-truth contact data for the classification process, we demonstrate that it generalizes across varying friction surfaces and different legged robotic platforms and, at the same time, is readily transferred from simulation to practice. The framework is quantitatively and qualitatively assessed in simulation via the use of ground-truth contact data and is contrasted against state of-the-art methods with an ATLAS, a NAO, and a TALOS humanoid robot. Furthermore, its efficacy is demonstrated in base estimation with a real TALOS humanoid. To reinforce further research endeavors, our implementation is offered as an open-source ROS/Python package, coined Legged Contact Detection (LCD).
LGJul 17, 2024Code
Analyzing the Generalization and Reliability of Steering VectorsDaniel Tan, David Chanin, Aengus Lynch et al.
Steering vectors (SVs) have been proposed as an effective approach to adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain technical difficulties of applying steering vectors to guide models' behaviour at scale. Our code is available at https://github.com/dtch1997/steering-bench
CVSep 15, 2022
One-Shot Transfer of Affordance Regions? AffCorrs!Denis Hadjivelichkov, Sicelukwanda Zwane, Marc Peter Deisenroth et al.
In this work, we tackle one-shot visual search of object parts. Given a single reference image of an object with annotated affordance regions, we segment semantically corresponding parts within a target scene. We propose AffCorrs, an unsupervised model that combines the properties of pre-trained DINO-ViT's image descriptors and cyclic correspondences. We use AffCorrs to find corresponding affordances both for intra- and inter-class one-shot part segmentation. This task is more difficult than supervised alternatives, but enables future work such as learning affordances via imitation and assisted teleoperation.
LGJun 6, 2023
Value Functions are Control Barrier Functions: Verification of Safe Policies using Control TheoryDaniel C. H. Tan, Fernando Acero, Robert McCarthy et al.
Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links between value functions and control barrier functions. Further, we propose novel metrics for verifying value functions in safe control tasks and practical implementation details to improve learning. Our work presents a novel method for certificate learning, which unlocks a diversity of verification techniques from control theory for RL policies, and marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems. Code and videos are available at this https url: https://rl-cbf.github.io/
RONov 30, 2024Code
Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor EnvironmentsJianhao Jiao, Ruoyu Geng, Yuanhang Li et al.
The creation of a metric-semantic map, which encodes human-prior knowledge, represents a high-level abstraction of environments. However, constructing such a map poses challenges related to the fusion of multi-modal sensor data, the attainment of real-time mapping performance, and the preservation of structural and semantic information consistency. In this paper, we introduce an online metric-semantic mapping system that utilizes LiDAR-Visual-Inertial sensing to generate a global metric-semantic mesh map of large-scale outdoor environments. Leveraging GPU acceleration, our mapping process achieves exceptional speed, with frame processing taking less than 7ms, regardless of scenario scale. Furthermore, we seamlessly integrate the resultant map into a real-world navigation system, enabling metric-semantic-based terrain assessment and autonomous point-to-point navigation within a campus environment. Through extensive experiments conducted on both publicly available and self-collected datasets comprising 24 sequences, we demonstrate the effectiveness of our mapping and navigation methodologies. Code has been publicly released: https://github.com/gogojjh/cobra
ROJan 29, 2025Code
Watch Your STEPP: Semantic Traversability Estimation using Pose Projected FeaturesSebastian Ægidius, Dennis Hadjivelichkov, Jianhao Jiao et al.
Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP
RODec 18, 2025
E-SDS: Environment-aware See it, Do it, Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid LocomotionEnis Yalcin, Joshua O'Hara, Maria Stamatopoulou et al.
Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
49.3ROApr 29
Reactive Motion Generation via Phase-varying Neural Potential FunctionsAhmet Tekden, Dimitrios Kanoulas, Aude Billard et al.
Dynamical systems (DS) methods for Learning-from-Demonstration (LfD) provide stable, continuous policies from few demonstrations. First-order dynamical systems (DS) are effective for many point-to-point and periodic tasks, as long as a unique velocity is defined for each state. For tasks with intersections (e.g., drawing an "8"), extensions such as second-order dynamics or phase variables are often used. However, by incorporating velocity, second-order models become sensitive to disturbances near intersections, as velocity is used to disambiguate motion direction. Moreover, this disambiguation may fail when nearly identical position-velocity pairs correspond to different onward motions. In contrast, phase-based methods rely on open-loop time or phase variables, which limit their ability to recover after perturbations. We introduce Phase-varying Neural Potential Functions (PNPF), an LfD framework that conditions a potential function on a phase variable which is estimated directly from state progression, rather than on open-loop temporal inputs. This phase variable allows the system to handle state revisits, while the learned potential function generates local vector fields for reactive and stable control. PNPF generalizes effectively across point-to-point, periodic, and full 6D motion tasks, outperforms existing baselines on trajectories with intersections, and demonstrates robust performance in real-time robotic manipulation under external disturbances.
CVMar 18, 2024
DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global EncodingWenhua Wu, Guangming Wang, Ting Deng et al.
Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.
35.1ROApr 21
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual InputMichael Ziegltrum, Jianhao Jiao, Tianhu Peng et al.
Robotic parkour provides a compelling benchmark for advancing locomotion over highly challenging terrain, including large discontinuities such as elevated steps. Recent approaches have demonstrated impressive capabilities, including dynamic climbing and jumping, but typically rely on sequential multilayer perceptron (MLP) architectures with densely activated layers. In contrast, sparsely gated mixture-of-experts (MoE) architectures have emerged in the large language model domain as an effective paradigm for improving scalability and performance by activating only a subset of parameters at inference time. In this work, we investigate the application of sparsely gated MoE architectures to vision-based robotic parkour. We compare control policies based on standard MLPs and MoE architectures under a controlled setting where the number of active parameters at inference time is matched. Experimental results on a real Unitree Go2 quadruped robot demonstrate clear performance gains, with the MoE policy achieving double the number of successful trials in traversing large obstacles compared to a standard MLP baseline. We further show that achieving comparable performance with a standard MLP requires scaling its parameter count to match that of the total MoE model, resulting in a 14.3\% increase in computation time. These results highlight that sparsely gated MoE architectures provide a favorable trade-off between performance and computational efficiency, enabling improved scaling of control policies for vision-based robotic parkour. An anonymized link to the codebase is https://osf.io/v2kqj/files/github?view_only=7977dee10c0a44769184498eaba72e44.
CVOct 15, 2024
LoGS: Visual Localization via Gaussian Splatting with Fewer Training ImagesYuzhou Cheng, Jianhao Jiao, Yue Wang et al.
Visual localization involves estimating a query image's 6-DoF (degrees of freedom) camera pose, which is a fundamental component in various computer vision and robotic tasks. This paper presents LoGS, a vision-based localization pipeline utilizing the 3D Gaussian Splatting (GS) technique as scene representation. This novel representation allows high-quality novel view synthesis. During the mapping phase, structure-from-motion (SfM) is applied first, followed by the generation of a GS map. During localization, the initial position is obtained through image retrieval, local feature matching coupled with a PnP solver, and then a high-precision pose is achieved through the analysis-by-synthesis manner on the GS map. Experimental results on four large-scale datasets demonstrate the proposed approach's SoTA accuracy in estimating camera poses and robustness under challenging few-shot conditions.
CVMar 27, 2024
AIR-HLoc: Adaptive Retrieved Images Selection for Efficient Visual LocalisationChangkun Liu, Jianhao Jiao, Huajian Huang et al.
State-of-the-art hierarchical localisation pipelines (HLoc) employ image retrieval (IR) to establish 2D-3D correspondences by selecting the top-$k$ most similar images from a reference database. While increasing $k$ improves localisation robustness, it also linearly increases computational cost and runtime, creating a significant bottleneck. This paper investigates the relationship between global and local descriptors, showing that greater similarity between the global descriptors of query and database images increases the proportion of feature matches. Low similarity queries significantly benefit from increasing $k$, while high similarity queries rapidly experience diminishing returns. Building on these observations, we propose an adaptive strategy that adjusts $k$ based on the similarity between the query's global descriptor and those in the database, effectively mitigating the feature-matching bottleneck. Our approach optimizes processing time without sacrificing accuracy. Experiments on three indoor and outdoor datasets show that AIR-HLoc reduces feature matching time by up to 30\%, while preserving state-of-the-art accuracy. The results demonstrate that AIR-HLoc facilitates a latency-sensitive localisation system.
ROApr 19, 2025
Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and RenderingJonathan Embley-Riches, Jianwei Liu, Simon Julier et al.
High-fidelity simulation is essential for robotics research, enabling safe and efficient testing of perception, control, and navigation algorithms. However, achieving both photorealistic rendering and accurate physics modeling remains a challenge. This paper presents a novel simulation framework--the Unreal Robotics Lab (URL) that integrates the Unreal Engine's advanced rendering capabilities with MuJoCo's high-precision physics simulation. Our approach enables realistic robotic perception while maintaining accurate physical interactions, facilitating benchmarking and dataset generation for vision-based robotics applications. The system supports complex environmental effects, such as smoke, fire, and water dynamics, which are critical for evaluating robotic performance under adverse conditions. We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios. By bridging the gap between physics accuracy and photorealistic rendering, our framework provides a powerful tool for advancing robotics research and sim-to-real transfer.
CVOct 14, 2025
UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal RenderingYusen Xie, Zhenmin Huang, Jianhao Jiao et al.
In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.
ROJun 5, 2025
Learning to Recover: Dynamic Reward Shaping with Wheel-Leg Coordination for Fallen RobotsBoyuan Deng, Luca Rossini, Jin Wang et al.
Adaptive recovery from fall incidents are essential skills for the practical deployment of wheeled-legged robots, which uniquely combine the agility of legs with the speed of wheels for rapid recovery. However, traditional methods relying on preplanned recovery motions, simplified dynamics or sparse rewards often fail to produce robust recovery policies. This paper presents a learning-based framework integrating Episode-based Dynamic Reward Shaping and curriculum learning, which dynamically balances exploration of diverse recovery maneuvers with precise posture refinement. An asymmetric actor-critic architecture accelerates training by leveraging privileged information in simulation, while noise-injected observations enhance robustness against uncertainties. We further demonstrate that synergistic wheel-leg coordination reduces joint torque consumption by 15.8% and 26.2% and improves stabilization through energy transfer mechanisms. Extensive evaluations on two distinct quadruped platforms achieve recovery success rates up to 99.1% and 97.8% without platform-specific tuning. The supplementary material is available at https://boyuandeng.github.io/L2R-WheelLegCoordination/
ROMay 22, 2023
End-to-End Stable Imitation Learning via Autonomous Neural Dynamic PoliciesDionis Totsila, Konstantinos Chatzilygeroudis, Denis Hadjivelichkov et al.
State-of-the-art sensorimotor learning algorithms offer policies that can often produce unstable behaviors, damaging the robot and/or the environment. Traditional robot learning, on the contrary, relies on dynamical system-based policies that can be analyzed for stability/safety. Such policies, however, are neither flexible nor generic and usually work only with proprioceptive sensor states. In this work, we bridge the gap between generic neural network policies and dynamical system-based policies, and we introduce Autonomous Neural Dynamic Policies (ANDPs) that: (a) are based on autonomous dynamical systems, (b) always produce asymptotically stable behaviors, and (c) are more flexible than traditional stable dynamical system-based policies. ANDPs are fully differentiable, flexible generic-policies that can be used in imitation learning setups while ensuring asymptotic stability. In this paper, we explore the flexibility and capacity of ANDPs in several imitation learning tasks including experiments with image observations. The results show that ANDPs combine the benefits of both neural network-based and dynamical system-based methods.
ROOct 5, 2021
Fully Self-Supervised Class Awareness in Dense Object DescriptorsDenis Hadjivelichkov, Dimitrios Kanoulas
We address the problem of inferring self-supervised dense semantic correspondences between objects in multi-object scenes. The method introduces learning of class-aware dense object descriptors by providing either unsupervised discrete labels or confidence in object similarities. We quantitatively and qualitatively show that the introduced method outperforms previous techniques with more robust pixel-to-pixel matches. An example robotic application is also shown~- grasping of objects in clutter based on corresponding points.
ROOct 5, 2021
Improved Reinforcement Learning Coordinated Control of a Mobile Manipulator using Joint ClampingDenis Hadjivelichkov, Kostas Vlachos, Dimitrios Kanoulas
Many robotic path planning problems are continuous, stochastic, and high-dimensional. The ability of a mobile manipulator to coordinate its base and manipulator in order to control its whole-body online is particularly challenging when self and environment collision avoidance is required. Reinforcement Learning techniques have the potential to solve such problems through their ability to generalise over environments. We study joint penalties and joint limits of a state-of-the-art mobile manipulator whole-body controller that uses LIDAR sensing for obstacle collision avoidance. We propose directions to improve the reinforcement learning method. Our agent achieves significantly higher success rates than the baseline in a goal-reaching environment and it can solve environments that require coordinated whole-body control which the baseline fails.
ROApr 5, 2020
Curved patch mapping and tracking for irregular terrain modeling: Application to bipedal robot foot placementDimitrios Kanoulas, Nikos G. Tsagarakis, Marsette Vona
Legged robots need to make contact with irregular surfaces, when operating in unstructured natural terrains. Representing and perceiving these areas to reason about potential contact between a robot and its surrounding environment, is still largely an open problem. This paper introduces a new framework to model and map local rough terrain surfaces, for tasks such as bipedal robot foot placement. The system operates in real-time, on data from an RGB-D and an IMU sensor. We introduce a set of parametrized patch models and an algorithm to fit them in the environment. Potential contacts are identified as bounded curved patches of approximately the same size as the robot's foot sole. This includes sparse seed point sampling, point cloud neighborhood search, and patch fitting and validation. We also present a mapping and tracking system, where patches are maintained in a local spatial map around the robot as it moves. A bio-inspired sampling algorithm is introduced for finding salient contacts. We include a dense volumetric fusion layer for spatiotemporally tracking, using multiple depth data to reconstruct a local point cloud. We present experimental results on a mini-biped robot that performs foot placements on rocks, implementing a 3D foothold perception system, that uses the developed patch mapping and tracking framework.
ROFeb 18, 2018
Center-of-Mass-Based Grasp Pose Adaptation Using 3D Range and Force/Torque SensingDimitrios Kanoulas, Jinoh Lee, Darwin G. Caldwell et al.
Lifting objects, whose mass may produce high wrist torques that exceed the hardware strength limits, could lead to unstable grasps or serious robot damage. This work introduces a new Center-of-Mass (CoM)-based grasp pose adaptation method, for picking up objects using a combination of exteroceptive 3D perception and proprioceptive force/torque sensor feedback. The method works in two iterative stages to provide reliable and wrist torque efficient grasps. Initially, a geometric object CoM is estimated from the input range data. In the first stage, a set of hand-size handle grasps are localized on the object and the closest to its CoM is selected for grasping. In the second stage, the object is lifted using a single arm, while the force and torque readings from the sensor on the wrist are monitored. Based on these readings, a displacement to the new CoM estimation is calculated. The object is released and the process is repeated until the wrist torque effort is minimized. The advantage of our method is the blending of both exteroceptive (3D range) and proprioceptive (force/torque) sensing for finding the grasp location that minimizes the wrist effort, potentially improving the reliability of the grasping and the subsequent manipulation task. We experimentally validate the proposed method by executing a number of tests on a set of objects that include handles, using the humanoid robot WALK-MAN.
ROOct 1, 2017
Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural NetworksAnh Nguyen, Dimitrios Kanoulas, Luca Muratore et al.
We present a new method to translate videos to commands for robotic manipulation using Deep Recurrent Neural Networks (RNN). Our framework first extracts deep features from the input video frames with a deep Convolutional Neural Networks (CNN). Two RNN layers with an encoder-decoder architecture are then used to encode the visual features and sequentially generate the output words as the command. We demonstrate that the translation accuracy can be improved by allowing a smooth transaction between two RNN layers and using the state-of-the-art feature extractor. The experimental results on our new challenging dataset show that our approach outperforms recent methods by a fair margin. Furthermore, we combine the proposed translation module with the vision and planning system to let a robot perform various manipulation tasks. Finally, we demonstrate the effectiveness of our framework on a full-size humanoid robot WALK-MAN.
RODec 19, 2016
Curved Surface Patches for Rough Terrain PerceptionDimitrios Kanoulas
Attaining animal-like legged locomotion on rough outdoor terrain with sparse foothold affordances -a primary use-case for legs vs other forms of locomotion- is a largely open problem. New advancements in control and perception have enabled bipeds to walk on flat and uneven indoor environments. But tasks that require reliable contact with unstructured world surfaces, for example walking on natural rocky terrain, need new perception and control algorithms. This thesis introduces 3D perception algorithms for contact tasks such as foot placement in rough terrain environments. We introduce a new method to identify and model potential contact areas between the robot's foot and a surface using a set of bounded curved patches. We present a patch parameterization model and an algorithm to fit and perceptually validate patches to 3D point samples. Having defined the environment representation using the patch model, we introduce a way to assemble patches into a spatial map. This map represents a sparse set of local areas potentially appropriate for contact between the robot and the surface. The process of creating such a map includes sparse seed point sampling, neighborhood searching, as well as patch fitting and validation. Various ways of sampling are introduced including a real time bio-inspired system for finding patches statistically similar to those that humans select while traversing rocky trails. These sparse patch algorithms are integrated with a dense volumetric fusion of range data from a moving depth camera, maintaining a dynamic patch map of relevant contact surfaces around a robot in real time. We integrate and test the algorithms as part of a real-time foothold perception system on a mini-biped robot, performing foot placements on rocks.