ROFeb 18
Articulated 3D Scene Graphs for Open-World Mobile ManipulationMartin Büchner, Adrian Röfer, Tim Engelbracht et al.
Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.
ROSep 1, 2025
Articulated Object Estimation in the WildAbdelrhman Werby, Martin Büchner, Adrian Röfer et al.
Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic unconstrained environments. In contrast, humans effortlessly infer articulation by watching others manipulate objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework that can infer articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset that captures articulated object interactions at a scene level, accompanied by articulation labels and ground-truth camera poses. We benchmark ArtiPoint against a range of classical and learning-based baselines, demonstrating its superior performance on Arti4D. We make code and Arti4D publicly available at https://artipoint.cs.uni-freiburg.de.
ROMay 6, 2025
The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy LearningJan Ole von Hartz, Adrian Röfer, Joschka Boedecker et al.
We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at https://midigap.cs.uni-freiburg.de.
ROMay 2, 2024
Imagine2touch: Predictive Tactile Sensing for Robotic Manipulation using Efficient Low-Dimensional SignalsAbdallah Ayad, Adrian Röfer, Nick Heppert et al.
Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term Imagine2touch. Imagine2touch aims to predict the expected touch signal based on a visual patch representing the area to be touched. We use ReSkin, an inexpensive and compact touch sensor to collect the required dataset through random touching of five basic geometric shapes, and one tool. We train Imagine2touch on two out of those shapes and validate it on the ood. tool. We demonstrate the efficacy of Imagine2touch through its application to the downstream task of object recognition. In this task, we evaluate Imagine2touch performance in two experiments, together comprising 5 out of training distribution objects. Imagine2touch achieves an object recognition accuracy of 58% after ten touches per object, surpassing a proprioception baseline.
ROMar 22, 2024
PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic ManipulationAdrian Röfer, Nick Heppert, Abdallah Ayad et al.
Tactile sensing is vital for human dexterous manipulation, however, it has not been widely used in robotics. Compact, low-cost sensing platforms can facilitate a change, but unlike their popular optical counterparts, they are difficult to deploy in high-fidelity tasks due to their low signal dimensionality and lack of a simulation model. To overcome these challenges, we introduce PseudoTouch which links high-dimensional structural information to low-dimensional sensor signals. It does so by learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. We collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the utility of our trained PseudoTouch model in two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields a 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at https://pseudotouch.cs.uni-freiburg.de.
ROJan 4
Online Estimation and Manipulation of Articulated ObjectsRussell Buchanan, Adrian Röfer, João Moura et al.
From refrigerators to kitchen drawers, humans interact with articulated objects effortlessly every day while completing household chores. For automating these tasks, service robots must be capable of manipulating arbitrary articulated objects. Recent deep learning methods have been shown to predict valuable priors on the affordance of articulated objects from vision. In contrast, many other works estimate object articulations by observing the articulation motion, but this requires the robot to already be capable of manipulating the object. In this article, we propose a novel approach combining these methods by using a factor graph for online estimation of articulation which fuses learned visual priors and proprioceptive sensing during interaction into an analytical model of articulation based on Screw Theory. With our method, a robotic system makes an initial prediction of articulation from vision before touching the object, and then quickly updates the estimate from kinematic and force sensing during manipulation. We evaluate our method extensively in both simulations and real-world robotic manipulation experiments. We demonstrate several closed-loop estimation and manipulation experiments in which the robot was capable of opening previously unseen drawers. In real hardware experiments, the robot achieved a 75% success rate for autonomous opening of unknown articulated objects.
ROJul 22, 2025
Designing for Difference: How Human Characteristics Shape Perceptions of Collaborative RobotsSabrina Livanec, Laura Londoño, Michael Gorki et al.
The development of assistive robots for social collaboration raises critical questions about responsible and inclusive design, especially when interacting with individuals from protected groups such as those with disabilities or advanced age. Currently, research is scarce on how participants assess varying robot behaviors in combination with diverse human needs, likely since participants have limited real-world experience with advanced domestic robots. In the current study, we aim to address this gap while using methods that enable participants to assess robot behavior, as well as methods that support meaningful reflection despite limited experience. In an online study, 112 participants (from both experimental and control groups) evaluated 7 videos from a total of 28 variations of human-robot collaboration types. The experimental group first completed a cognitive-affective mapping (CAM) exercise on human-robot collaboration before providing their ratings. Although CAM reflection did not significantly affect overall ratings, it led to more pronounced assessments for certain combinations of robot behavior and human condition. Most importantly, the type of human-robot collaboration influences the assessment. Antisocial robot behavior was consistently rated as the lowest, while collaboration with aged individuals elicited more sensitive evaluations. Scenarios involving object handovers were viewed more positively than those without them. These findings suggest that both human characteristics and interaction paradigms influence the perceived acceptability of collaborative robots, underscoring the importance of prosocial design. They also highlight the potential of reflective methods, such as CAM, to elicit nuanced feedback, supporting the development of user-centered and socially responsible robotic systems tailored to diverse populations.
ROFeb 5, 2022
Doing Right by Not Doing Wrong in Human-Robot CollaborationLaura Londoño, Adrian Röfer, Tim Welschehold et al.
As robotic systems become more and more capable of assisting humans in their everyday lives, we must consider the opportunities for these artificial agents to make their human collaborators feel unsafe or to treat them unfairly. Robots can exhibit antisocial behavior causing physical harm to people or reproduce unfair behavior replicating and even amplifying historical and societal biases which are detrimental to humans they interact with. In this paper, we discuss these issues considering sociable robotic manipulation and fair robotic decision making. We propose a novel approach to learning fair and sociable behavior, not by reproducing positive behavior, but rather by avoiding negative behavior. In this study, we highlight the importance of incorporating sociability in robot manipulation, as well as the need to consider fairness in human-robot interactions.
RONov 25, 2021
Robot Skill Adaptation via Soft Actor-Critic Gaussian Mixture ModelsIman Nematollahi, Erick Rosete-Beas, Adrian Röfer et al.
A core challenge for an autonomous agent acting in the real world is to adapt its repertoire of skills to cope with its noisy perception and dynamics. To scale learning of skills to long-horizon tasks, robots should be able to learn and later refine their skills in a structured manner through trajectories rather than making instantaneous decisions individually at each time step. To this end, we propose the Soft Actor-Critic Gaussian Mixture Model (SAC-GMM), a novel hybrid approach that learns robot skills through a dynamical system and adapts the learned skills in their own trajectory distribution space through interactions with the environment. Our approach combines classical robotics techniques of learning from demonstration with the deep reinforcement learning framework and exploits their complementary nature. We show that our method utilizes sensors solely available during the execution of preliminarily learned skills to extract relevant features that lead to faster skill refinement. Extensive evaluations in both simulation and real-world environments demonstrate the effectiveness of our method in refining robot skills by leveraging physical interactions, high-dimensional sensory data, and sparse task completion rewards. Videos, code, and pre-trained models are available at http://sac-gmm.cs.uni-freiburg.de.
RODec 9, 2020
Kineverse: A Symbolic Articulation Model Framework for Model-Agnostic Mobile ManipulationAdrian Röfer, Georg Bartels, Wolfram Burgard et al.
Service robots in the future need to execute abstract instructions such as "fetch the milk from the fridge". To translate such instructions into actionable plans, robots require in-depth background knowledge. With regards to interactions with doors and drawers, robots require articulation models that they can use for state estimation and motion planning. Existing frameworks model articulated connections as abstract concepts such as prismatic, or revolute, but do not provide a parameterized model of these connections for computation. In this paper, we introduce a novel framework that uses symbolic mathematical expressions to model articulated structures -- robots and objects alike -- in a unified and extensible manner. We provide a theoretical description of this framework, and the operations that are supported by its models, and introduce an architecture to exchange our models in robotic applications, making them as flexible as any other environmental observation. To demonstrate the utility of our approach, we employ our practical implementation Kineverse for solving common robotics tasks from state estimation and mobile manipulation, and use it further in real-world mobile robot manipulation.
ROJun 18, 2020
Semantic Linking Maps for Active Visual Object SearchZhen Zeng, Adrian Röfer, Odest Chadwicke Jenkins
We aim for mobile robots to function in a variety of common human environments. Such robots need to be able to reason about the locations of previously unseen target objects. Landmark objects can help this reasoning by narrowing down the search space significantly. More specifically, we can exploit background knowledge about common spatial relations between landmark and target objects. For example, seeing a table and knowing that cups can often be found on tables aids the discovery of a cup. Such correlations can be expressed as distributions over possible pairing relationships of objects. In this paper, we propose an active visual object search strategy method through our introduction of the Semantic Linking Maps (SLiM) model. SLiM simultaneously maintains the belief over a target object's location as well as landmark objects' locations, while accounting for probabilistic inter-object spatial relations. Based on SLiM, we describe a hybrid search strategy that selects the next best view pose for searching for the target object based on the maintained belief. We demonstrate the efficiency of our SLiM-based search strategy through comparative experiments in simulated environments. We further demonstrate the real-world applicability of SLiM-based search in scenarios with a Fetch mobile manipulation robot.