ROAug 4, 2022Code
LATTE: LAnguage Trajectory TransformErArthur Bucker, Luis Figueredo, Sami Haddadin et al.
Natural language is one of the most intuitive ways to express human intent. However, translating instructions and commands towards robotic motion generation and deployment in the real world is far from being an easy task. The challenge of combining a robot's inherent low-level geometric and kinodynamic constraints with a human's high-level semantic instructions traditionally is solved using task-specific solutions with little generalizability between hardware platforms, often with the use of static sets of target actions and commands. This work instead proposes a flexible language-based framework that allows a user to modify generic robotic trajectories. Our method leverages pre-trained language models (BERT and CLIP) to encode the user's intent and target objects directly from a free-form text input and scene images, fuses geometrical features generated by a transformer encoder network, and finally outputs trajectories using a transformer decoder, without the need of priors related to the task or robot information. We significantly extend our own previous work presented in Bucker et al. by expanding the trajectory parametrization space to 3D and velocity as opposed to just XY movements. In addition, we now train the model to use actual images of the objects in the scene for context (as opposed to textual descriptions), and we evaluate the system in a diverse set of scenarios beyond manipulation, such as aerial and legged robots. Our simulated and real-life experiments demonstrate that our transformer model can successfully follow human intent, modifying the shape and speed of trajectories within multiple environments. Codebase available at: https://github.com/arthurfenderbucker/LaTTe-Language-Trajectory-TransformEr.git
ROMay 27
RCM Constraint-Consistent Dynamic Control in Surgical RobotsYu Li, Hamid Sadeghian, Zewen Yang et al.
Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io
ROMar 25, 2022
Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using TransformersArthur Bucker, Luis Figueredo, Sami Haddadin et al.
Natural language is the most intuitive medium for us to interact with other people when expressing commands and instructions. However, using language is seldom an easy task when humans need to express their intent towards robots, since most of the current language interfaces require rigid templates with a static set of action targets and commands. In this work, we provide a flexible language-based interface for human-robot collaboration, which allows a user to reshape existing trajectories for an autonomous agent. We take advantage of recent advancements in the field of large language models (BERT and CLIP) to encode the user command, and then combine these features with trajectory information using multi-modal attention transformers. We train the model using imitation learning over a dataset containing robot trajectories modified by language commands, and treat the trajectory generation process as a sequence prediction problem, analogously to how language generation architectures operate. We evaluate the system in multiple simulated trajectory scenarios, and show a significant performance increase of our model over baseline approaches. In addition, our real-world experiments with a robot arm show that users significantly prefer our natural language interface over traditional methods such as kinesthetic teaching or cost-function programming. Our study shows how the field of robotics can take advantage of large pre-trained language models towards creating more intuitive interfaces between robots and machines. Project webpage: https://arthurfenderbucker.github.io/NL_trajectory_reshaper/
ROOct 9, 2023
Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care EnvironmentsMichael G. Adam, Sebastian Eger, Martin Piccolrovazzi et al.
As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We further provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot.
ROOct 18, 2023
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop ManipulationShengqiang Zhang, Philipp Wicke, Lütfi Kerem Şenel et al.
The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
RONov 4, 2023
Anthropomorphic Grasping with Neural Object Shape CompletionDiego Hidalgo-Carvajal, Hanzhi Chen, Gemma C. Bettelani et al.
The progressive prevalence of robots in human-suited environments has given rise to a myriad of object manipulation techniques, in which dexterity plays a paramount role. It is well-established that humans exhibit extraordinary dexterity when handling objects. Such dexterity seems to derive from a robust understanding of object properties (such as weight, size, and shape), as well as a remarkable capacity to interact with them. Hand postures commonly demonstrate the influence of specific regions on objects that need to be grasped, especially when objects are partially visible. In this work, we leverage human-like object understanding by reconstructing and completing their full geometry from partial observations, and manipulating them using a 7-DoF anthropomorphic robot hand. Our approach has significantly improved the grasping success rates of baselines with only partial reconstruction by nearly 30% and achieved over 150 successful grasps with three different object categories. This demonstrates our approach's consistent ability to predict and execute grasping postures based on the completed object shapes from various directions and positions in real-world scenarios. Our work opens up new possibilities for enhancing robotic applications that require precise grasping and manipulation skills of real-world reconstructed objects.
ROSep 18, 2024
LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation TasksKejia Chen, Zheng Shen, Yue Zhang et al.
Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.
ROFeb 2Code
Bridging the Sim-to-Real Gap with multipanda ros2: A Real-Time ROS2 Framework for Multimanual SystemsJon Škerlj, Seongjin Bien, Abdeldjallil Naceri et al.
We present $multipanda\_ros2$, a novel open-source ROS2 architecture for multi-robot control of Franka Robotics robots. Leveraging ros2 control, this framework provides native ROS2 interfaces for controlling any number of robots from a single process. Our core contributions address key challenges in real-time torque control, including interaction control and robot-environment modeling. A central focus of this work is sustaining a 1kHz control frequency, a necessity for real-time control and a minimum frequency required by safety standards. Moreover, we introduce a controllet-feature design pattern that enables controller-switching delays of $\le 2$ ms, facilitating reproducible benchmarking and complex multi-robot interaction scenarios. To bridge the simulation-to-reality (sim2real) gap, we integrate a high-fidelity MuJoCo simulation with quantitative metrics for both kinematic accuracy and dynamic consistency (torques, forces, and control errors). Furthermore, we demonstrate that real-world inertial parameter identification can significantly improve force and torque accuracy, providing a methodology for iterative physics refinement. Our work extends approaches from soft robotics to rigid dual-arm, contact-rich tasks, showcasing a promising method to reduce the sim2real gap and providing a robust, reproducible platform for advanced robotics research.
ROJan 24, 2022Code
Towards Remote Robotic Competitions: An Internet-Connected Task Board and DashboardPeter So, Jonas Wittmann, Patrick Ruhkamp et al.
In this work we present a platform to assess robot platform skills using an internet-of-things (IoT) task board device to aggregate performances across remote sites. We demonstrate a concept for a modular, scale-able device and web dashboard enabling remote competitions as an alternative to in-person robot competitions. We share data from nine robot platforms located across four continents in three manipulation task categories of object localization, object insertion, and component disassembly through an organized international robot competition - the Robothon Grand Challenge. This paper discusses the design of an electronic task board, the strategies implemented by the top-performing teams and compares their results with a benchmark solution to the presented task board. Through this platform, we demonstrate fully remote, online competitions can generate innovative robotic solutions and tested a tool for measuring remote performances. Using the open-sourced task board code and design files, the reader can reproduce the benchmark solution or configure the platform for their own use case and share their results transparently without transporting their robot platform.
ROSep 17, 2025
Prompt2Auto: From Motion Prompt to Automated Control via Geometry-Invariant One-Shot Gaussian Process LearningZewen Yang, Xiaobing Dai, Dongfa Zhang et al.
Learning from demonstration allows robots to acquire complex skills from human demonstrations, but conventional approaches often require large datasets and fail to generalize across coordinate transformations. In this paper, we propose Prompt2Auto, a geometry-invariant one-shot Gaussian process (GeoGP) learning framework that enables robots to perform human-guided automated control from a single motion prompt. A dataset-construction strategy based on coordinate transformations is introduced that enforces invariance to translation, rotation, and scaling, while supporting multi-step predictions. Moreover, GeoGP is robust to variations in the user's motion prompt and supports multi-skill autonomy. We validate the proposed approach through numerical simulations with the designed user graphical interface and two real-world robotic experiments, which demonstrate that the proposed method is effective, generalizes across tasks, and significantly reduces the demonstration burden. Project page is available at: https://prompt2auto.github.io
LGAug 5, 2025
Streaming Generated Gaussian Process Experts for Online Learning and Control: Extended VersionZewen Yang, Dongfa Zhang, Xiaobing Dai et al.
Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a streaming kernel-induced progressively generated expert framework of Gaussian processes (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.
ROMar 19, 2025
Geometrically-Aware One-Shot Skill Transfer of Category-Level ObjectsCristiana de Farias, Luis Figueredo, Riddhiman Laha et al.
Robotic manipulation of unfamiliar objects in new environments is challenging and requires extensive training or laborious pre-programming. We propose a new skill transfer framework, which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FM) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates a Task-Space Imitation Algorithm (TSIA) which generates smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.
ROOct 19, 2021
Learning Robotic Manipulation Skills Using an Adaptive Force-Impedance Action SpaceMaximilian Ulmer, Elie Aljalbout, Sascha Schwarz et al.
Intelligent agents must be able to think fast and slow to perform elaborate manipulation tasks. Reinforcement Learning (RL) has led to many promising results on a range of challenging decision-making tasks. However, in real-world robotics, these methods still struggle, as they require large amounts of expensive interactions and have slow feedback loops. On the other hand, fast human-like adaptive control methods can optimize complex robotic interactions, yet fail to integrate multimodal feedback needed for unstructured tasks. In this work, we propose to factor the learning problem in a hierarchical learning and adaption architecture to get the best of both worlds. The framework consists of two components, a slow reinforcement learning policy optimizing the task strategy given multimodal observations, and a fast, real-time adaptive control policy continuously optimizing the motion, stability, and effort of the manipulator. We combine these components through a bio-inspired action space that we call AFORCE. We demonstrate the new action space on a contact-rich manipulation task on real hardware and evaluate its performance on three simulated manipulation tasks. Our experiments show that AFORCE drastically improves sample efficiency while reducing energy consumption and improving safety.
ROSep 15, 2021
Expectable Motion Unit: Avoiding Hazards From Human Involuntary Motions in Human-Robot InteractionRobin Jeanne Kirschner, Henning Mayer, Lisa Burr et al.
In robotics, many control and planning schemes have been developed to ensure human physical safety in human-robot interaction. The human psychological state and the expectation towards the robot, however, are typically neglected. Even if the robot behaviour is regarded as biomechanically safe, humans may still react with a rapid involuntary motion (IM) caused by a startle or surprise. Such sudden, uncontrolled motions can jeopardize safety and should be prevented by any means. In this letter, we propose the Expectable Motion Unit (EMU), which ensures that a certain probability of IM occurrence is not exceeded in a typical HRI setting. Based on a model of IM occurrence generated through an experiment with 29 participants, we establish the mapping between robot velocity, robot-human distance, and the relative frequency of IM occurrence. This mapping is processed towards a real-time capable robot motion generator that limits the robot velocity during task execution if necessary. The EMU is combined in a holistic safety framework that integrates both the physical and psychological safety knowledge. A validation experiment showed that the EMU successfully avoids human IM in five out of six cases.
LGMay 30, 2021
SyReNets: Symbolic Residual Neural NetworksCarlos Magno C. O. Valle, Sami Haddadin
Despite successful seminal works on passive systems in the literature, learning free-form physical laws for controlled dynamical systems given experimental data is still an open problem. For decades, symbolic mathematical equations and system identification were the golden standards. Unfortunately, a set of assumptions about the properties of the underlying system is required, which makes the model very rigid and unable to adapt to unforeseen changes in the physical system. Neural networks, on the other hand, are known universal function approximators but are prone to over-fit, limited accuracy, and bias problems, which makes them alone unreliable candidates for such tasks. In this paper, we propose SyReNets, an approach that leverages neural networks for learning symbolic relations to accurately describe dynamic physical systems from data. It explores a sequence of symbolic layers that build, in a residual manner, mathematical relations that describes a given desired output from input variables. We apply it to learn the symbolic equation that describes the Lagrangian of a given physical system. We do this by only observing random samples of position, velocity, and acceleration as input and torque as output. Therefore, using the Lagrangian as a latent representation from which we derive torque using the Euler-Lagrange equations. The approach is evaluated using a simulated controlled double pendulum and compared with neural networks, genetic programming, and traditional system identification. The results demonstrate that, compared to neural networks and genetic programming, SyReNets converges to representations that are more accurate and precise throughout the state space. Despite having slower convergence than traditional system identification, similar to neural networks, the approach remains flexible enough to adapt to an unforeseen change in the physical system structure.
ROOct 30, 2020
Learning Vision-based Reactive Policies for Obstacle AvoidanceElie Aljalbout, Ji Chen, Konstantin Ritt et al.
In this paper, we address the problem of vision-based obstacle avoidance for robotic manipulators. This topic poses challenges for both perception and motion generation. While most work in the field aims at improving one of those aspects, we provide a unified framework for approaching this problem. The main goal of this framework is to connect perception and motion by identifying the relationship between the visual input and the corresponding motion representation. To this end, we propose a method for learning reactive obstacle avoidance policies. We evaluate our method on goal-reaching tasks for single and multiple obstacles scenarios. We show the ability of the proposed method to efficiently learn stable obstacle avoidance strategies at a high success rate, while maintaining closed-loop responsiveness required for critical applications like human-robot interaction.
LGOct 25, 2020
How to Make Deep RL Work in PracticeNirnai Rao, Elie Aljalbout, Axel Sauer et al.
In recent years, challenging control problems became solvable with deep reinforcement learning (RL). To be able to use RL for large-scale real-world applications, a certain degree of reliability in their performance is necessary. Reported results of state-of-the-art algorithms are often difficult to reproduce. One reason for this is that certain implementation details influence the performance significantly. Commonly, these details are not highlighted as important techniques to achieve state-of-the-art performance. Additionally, techniques from supervised learning are often used by default but influence the algorithms in a reinforcement learning setting in different and not well-understood ways. In this paper, we investigate the influence of certain initialization, input normalization, and adaptive learning techniques on the performance of state-of-the-art RL algorithms. We make suggestions which of those techniques to use by default and highlight areas that could benefit from a solution specifically tailored to RL.
CVJul 21, 2019
Tracking Holistic Object RepresentationsAxel Sauer, Elie Aljalbout, Sami Haddadin
Recent advances in visual tracking are based on siamese feature extractors and template matching. For this category of trackers, latest research focuses on better feature embeddings and similarity measures. In this work, we focus on building holistic object representations for tracking. We propose a framework that is designed to be used on top of previous trackers without any need for further training of the siamese network. The framework leverages the idea of obtaining additional object templates during the tracking process. Since the number of stored templates is limited, our method only keeps the most diverse ones. We achieve this by providing a new diversity measure in the space of siamese features. The obtained representation contains information beyond the ground truth object location provided to the system. It is then useful for tracking itself but also for further tasks which require a visual understanding of objects. Strong empirical results on tracking benchmarks indicate that our method can improve the performance and robustness of the underlying trackers while barely reducing their speed. In addition, our method is able to match current state-of-the-art results, while using a simpler and older network architecture and running three times faster.
ROOct 30, 2018
Simultaneous Contact and Aerodynamic Force Estimation (s-CAFE) for Aerial RobotsTeodor Tomić, Philipp Lutz, Korbinian Schmid et al.
In this paper, we consider the problem of multirotor flying robots physically interacting with the environment under wind influence. The result are the first algorithms for simultaneous online estimation of contact and aerodynamic wrenches acting on the robot based on real-world data, without the need for dedicated sensors. For this purpose, we investigate two model-based techniques for discriminating between aerodynamic and interaction forces. The first technique is based on aerodynamic and contact torque models, and uses the external force to estimate wind speed. Contacts are then detected based on the residual between estimated external torque and expected (modeled) aerodynamic torque. Upon detecting contact, wind speed is assumed to change very slowly. From the estimated interaction wrench, we are also able to determine the contact location. This is embedded into a particle filter framework to further improve contact location estimation. The second algorithm uses the propeller aerodynamic power and angular speed as measured by the speed controllers to obtain an estimate of the airspeed. An aerodynamics model is then used to determine the aerodynamic wrench. Both methods rely on accurate aerodynamics models. Therefore, we evaluate data-driven and physics based models as well as offline system identification for flying robots. For obtaining ground truth data we performed autonomous flights in a 3D wind tunnel. Using this data, aerodynamic model selection, parameter identification, and discrimination between aerodynamic and contact forces could be done. Finally, the developed methods could serve as useful estimators for interaction control schemes with simultaneous compensation of wind disturbances.
ROMay 22, 2018
A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive ControlLars Johannsmeier, Malkin Gerchow, Sami Haddadin
In this paper we introduce a novel framework for expressing and learning force-sensitive robot manipulation skills. It is based on a formalism that extends our previous work on adaptive impedance control with meta parameter learning and compatible skill specifications. This way the system is also able to make use of abstract expert knowledge by incorporating process descriptions and quality evaluation metrics. We evaluate various state-of-the-art schemes for the meta parameter learning and experimentally compare selected ones. Our results clearly indicate that the combination of our adaptive impedance controller with a carefully defined skill formalism significantly reduces the complexity of manipulation tasks even for learning peg-in-hole with submillimeter industrial tolerances. Overall, the considered system is able to learn variations of this skill in under 20 minutes. In fact, experimentally the system was able to perform the learned tasks faster than humans, leading to the first learning-based solution of complex assembly at such real-world performance.