CVMar 29, 2022Code
Abstract Flow for Temporal Semantic Segmentation on the Permutohedral LatticePeer Schütt, Radu Alexandru Rosu, Sven Behnke
Semantic segmentation is a core ability required by autonomous agents, as being able to distinguish which parts of the scene belong to which object class is crucial for navigation and interaction with the environment. Approaches which use only one time-step of data cannot distinguish between moving objects nor can they benefit from temporal integration. In this work, we extend a backbone LatticeNet to process temporal point cloud data. Additionally, we take inspiration from optical flow methods and propose a new module called Abstract Flow which allows the network to match parts of the scene with similar abstract features and gather the information temporally. We obtain state-of-the-art results on the SemanticKITTI dataset that contains LiDAR scans from real urban environments. We share the PyTorch implementation of TemporalLatticeNet at https://github.com/AIS-Bonn/temporal_latticenet .
CVMar 17, 2022Code
MSPred: Video Prediction at Multiple Spatio-Temporal Scales with Hierarchical Recurrent NetworksAngel Villar-Corrales, Ani Karapetyan, Andreas Boltres et al.
Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. However, existing models mainly focus on forecasting future video frames for short time-horizons, hence being of limited use for long-term action planning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to simultaneously forecast future possible outcomes of different levels of granularity at different spatio-temporal scales. By combining spatial and temporal downsampling, MSPred efficiently predicts abstract representations such as human poses or locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that MSPred accurately predicts future video frames as well as high-level representations (e.g. keypoints or semantics) on bin-picking and action recognition datasets, while consistently outperforming popular approaches for future frame prediction. Furthermore, we ablate different modules and design choices in MSPred, experimentally validating that combining features of different spatial and temporal granularity leads to a superior performance. Code and models to reproduce our experiments can be found in https://github.com/AIS-Bonn/MSPred.
CVNov 22, 2022
PermutoSDF: Fast Multi-View Reconstruction with Implicit Surfaces using Permutohedral LatticesRadu Alexandru Rosu, Sven Behnke
Neural radiance-density field methods have become increasingly popular for the task of novel-view rendering. Their recent extension to hash-based positional encoding ensures fast training and inference with visually pleasing results. However, density-based methods struggle with recovering accurate surface geometry. Hybrid methods alleviate this issue by optimizing the density based on an underlying SDF. However, current SDF methods are overly smooth and miss fine geometric details. In this work, we combine the strengths of these two lines of work in a novel hash-based implicit surface representation. We propose improvements to the two areas by replacing the voxel hash encoding with a permutohedral lattice which optimizes faster, especially for higher dimensions. We additionally propose a regularization scheme which is crucial for recovering high-frequency geometric detail. We evaluate our method on multiple datasets and show that we can recover geometric detail at the level of pores and wrinkles while using only RGB images for supervision. Furthermore, using sphere tracing we can render novel views at 30 fps on an RTX 3090. Code is publicly available at: https://radualexandru.github.io/permuto_sdf
CVJul 28, 2022
Neural Strands: Learning Hair Geometry and Appearance from Multi-View ImagesRadu Alexandru Rosu, Shunsuke Saito, Ziyan Wang et al.
We present Neural Strands, a novel learning framework for modeling accurate hair geometry and appearance from multi-view image inputs. The learned hair model can be rendered in real-time from any viewpoint with high-fidelity view-dependent effects. Our model achieves intuitive shape and style control unlike volumetric counterparts. To enable these properties, we propose a novel hair representation based on a neural scalp texture that encodes the geometry and appearance of individual strands at each texel location. Furthermore, we introduce a novel neural rendering framework based on rasterization of the learned hair strands. Our neural rendering is strand-accurate and anti-aliased, making the rendering view-consistent and photorealistic. Combining appearance with a multi-view geometric prior, we enable, for the first time, the joint learning of appearance and explicit hair geometry from a multi-view setup. We demonstrate the efficacy of our approach in terms of fidelity and efficiency for various hairstyles.
CVMay 5, 2022
YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint RegressionArash Amini, Arul Selvam Periyasamy, Sven Behnke
6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods.
CVApr 24Code
Efficient Image Annotation via Semi-Supervised Object Segmentation with Label PropagationVitalii Tutevych, Raphael Memmesheimer, Luca Eichler et al.
Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.
CVOct 18, 2022
Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles with Label Propagation for Cross-Domain AdaptationSimon Bultmann, Jan Quenzel, Sven Behnke
Unmanned aerial vehicles (UAVs) equipped with multiple complementary sensors have tremendous potential for fast autonomous or remote-controlled semantic scene analysis, e.g., for disaster examination. Here, we propose a UAV system for real-time semantic inference and fusion of multiple sensor modalities. Semantic segmentation of LiDAR scans and RGB images, as well as object detection on RGB and thermal images, run online onboard the UAV computer using lightweight CNN architectures and embedded inference accelerators. We follow a late fusion approach where semantic information from multiple sensor modalities augments 3D point clouds and image segmentation masks while also generating an allocentric semantic map. Label propagation on the semantic map allows for sensor-specific adaptation with cross-modality and cross-domain supervision. Our system provides augmented semantic images and point clouds with $\approx$ 9 Hz. We evaluate the integrated system in real-world experiments in an urban environment and at a disaster test site.
CVJul 21, 2023
YOLOPose V2: Understanding and Improving Transformer-based 6D Pose EstimationArul Selvam Periyasamy, Arash Amini, Vladimir Tsaturyan et al.
6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression and an improved variant of the YOLOPose model. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods. We analyze the role of object queries in our architecture and reveal that the object queries specialize in detecting objects in specific image regions. Furthermore, we quantify the accuracy trade-off of using datasets of smaller sizes to train our model.
CVMar 17, 2022
Synthetic-to-Real Domain Adaptation using Contrastive Unpaired TranslationBenedikt T. Imbusch, Max Schwarz, Sven Behnke
The usefulness of deep learning models in robotics is largely dependent on the availability of training data. Manual annotation of training data is often infeasible. Synthetic data is a viable alternative, but suffers from domain gap. We propose a multi-step method to obtain training data without manual annotation effort: From 3D object meshes, we generate images using a modern synthesis pipeline. We utilize a state-of-the-art image-to-image translation method to adapt the synthetic images to the real domain, minimizing the domain gap in a learned manner. The translation network is trained from unpaired images, i.e. just requires an un-annotated collection of real images. The generated and refined images can then be used to train deep learning models for a particular task. We also propose and evaluate extensions to the translation method that further increase performance, such as patch-based training, which shortens training time and increases global consistency. We evaluate our method and demonstrate its effectiveness on two robotic datasets. We finally give insight into the learned refinement operations.
ROMar 7, 2023
External Camera-based Mobile Robot Pose Estimation for Collaborative Perception with Smart Edge SensorsSimon Bultmann, Raphael Memmesheimer, Sven Behnke
We present an approach for estimating a mobile robot's pose w.r.t. the allocentric coordinates of a network of static cameras using multi-view RGB images. The images are processed online, locally on smart edge sensors by deep neural networks to detect the robot and estimate 2D keypoints defined at distinctive positions of the 3D robot model. Robot keypoint detections are synchronized and fused on a central backend, where the robot's pose is estimated via multi-view minimization of reprojection errors. Through the pose estimation from external cameras, the robot's localization can be initialized in an allocentric map from a completely unknown state (kidnapped robot problem) and robustly tracked over time. We conduct a series of experiments evaluating the accuracy and robustness of the camera-based pose estimation compared to the robot's internal navigation stack, showing that our camera-based method achieves pose errors below 3 cm and 1° and does not drift over time, as the robot is localized allocentrically. With the robot's pose precisely estimated, its observations can be fused into the allocentric scene model. We show a real-world application, where observations from mobile robot and static smart edge sensors are fused to collaboratively build a 3D semantic map of a $\sim$240 m$^2$ indoor environment.
RODec 5, 2022
Accelerating Interactive Human-like Manipulation Learning with GPU-based Simulation and High-quality DemonstrationsMalte Mosbach, Kara Moraw, Sven Behnke
Dexterous manipulation with anthropomorphic robot hands remains a challenging problem in robotics because of the high-dimensional state and action spaces and complex contacts. Nevertheless, skillful closed-loop manipulation is required to enable humanoid robots to operate in unstructured real-world environments. Reinforcement learning (RL) has traditionally imposed enormous interaction data requirements for optimizing such complex control problems. We introduce a new framework that leverages recent advances in GPU-based simulation along with the strength of imitation learning in guiding policy search towards promising behaviors to make RL training feasible in these domains. To this end, we present an immersive virtual reality teleoperation interface designed for interactive human-like manipulation on contact rich tasks and a suite of manipulation environments inspired by tasks of daily living. Finally, we demonstrate the complementary strengths of massively parallel RL and imitation learning, yielding robust and natural behaviors. Videos of trained policies, our source code, and the collected demonstration datasets are available at https://maltemosbach.github.io/interactive_ human_like_manipulation/.
LGMay 6Code
Dream-MPC: Gradient-Based Model Predictive Control with Latent ImaginationJonathan Spieler, Sven Behnke
State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at https://dream-mpc.github.io.
CVFeb 23, 2023
Object-Centric Video Prediction via Decoupling of Object Dynamics and InteractionsAngel Villar-Corrales, Ismail Wahdan, Sven Behnke
We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to predict the future object states, from which we can then generate subsequent video frames. With the goal of learning meaningful spatio-temporal object representations and accurately forecasting object states, we propose two novel object-centric video predictor (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions, thus presenting an improved prediction performance. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets, while maintaining consistent and accurate object representations.
ROAug 15, 2023
The $10 Million ANA Avatar XPRIZE Competition Advanced Immersive Telepresence SystemsSven Behnke, Julie A. Adams, David Locke
The $10M ANA Avatar XPRIZE aimed to create avatar systems that can transport human presence to remote locations in real time. The participants of this multi-year competition developed robotic systems that allow operators to see, hear, and interact with a remote environment in a way that feels as if they are truly there. On the other hand, people in the remote environment were given the impression that the operator was present inside the avatar robot. At the competition finals, held in November 2022 in Long Beach, CA, USA, the avatar systems were evaluated on their support for remotely interacting with humans, exploring new environments, and employing specialized skills. This article describes the competition stages with tasks and evaluation procedures, reports the results, presents the winning teams' approaches, and discusses lessons learned.
CVApr 24, 2023
VR Facial Animation for Immersive Telepresence AvatarsAndre Rochow, Max Schwarz, Michael Schreiber et al.
VR Facial Animation is necessary in applications requiring clear view of the face, even though a VR headset is worn. In our case, we aim to animate the face of an operator who is controlling our robotic avatar system. We propose a real-time capable pipeline with very fast adaptation for specific operators. In a quick enrollment step, we capture a sequence of source images from the operator without the VR headset which contain all the important operator-specific appearance information. During inference, we then use the operator keypoint information extracted from a mouth camera and two eye cameras to estimate the target expression and head pose, to which we map the appearance of a source still image. In order to enhance the mouth expression accuracy, we dynamically select an auxiliary expression frame from the captured sequence. This selection is done by learning to transform the current mouth keypoints into the source camera space, where the alignment can be determined accurately. We, furthermore, demonstrate an eye tracking pipeline that can be trained in less than a minute, a time efficient way to train the whole pipeline given a dataset that includes only complete faces, show exemplary results generated by our method, and discuss performance at the ANA Avatar XPRIZE semifinals.
CVSep 15, 2022
Online Marker-free Extrinsic Camera Calibration using Person Keypoint DetectionsBastian Pätzold, Simon Bultmann, Sven Behnke
Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.
CVMay 3, 2022
3D Semantic Scene Perception using Distributed Smart Edge SensorsSimon Bultmann, Sven Behnke
We present a system for 3D semantic scene perception consisting of a network of distributed smart edge sensors. The sensor nodes are based on an embedded CNN inference accelerator and RGB-D and thermal cameras. Efficient vision CNN models for object detection, semantic segmentation, and human pose estimation run on-device in real time. 2D human keypoint estimations, augmented with the RGB-D depth estimate, as well as semantically annotated point clouds are streamed from the sensors to a central backend, where multiple viewpoints are fused into an allocentric 3D semantic scene model. As the image interpretation is computed locally, only semantic information is sent over the network. The raw images remain on the sensor boards, significantly reducing the required bandwidth, and mitigating privacy risks for the observed persons. We evaluate the proposed system in challenging real-world multi-person scenes in our lab. The proposed perception system provides a complete scene view containing semantically annotated 3D geometry and estimates 3D poses of multiple persons in real time.
RONov 20, 2022
Efficient Representations of Object Geometry for Reinforcement Learning of Interactive Grasping PoliciesMalte Mosbach, Sven Behnke
Grasping objects of different shapes and sizes - a foundational, effortless skill for humans - remains a challenging task in robotics. Although model-based approaches can predict stable grasp configurations for known object models, they struggle to generalize to novel objects and often operate in a non-interactive open-loop manner. In this work, we present a reinforcement learning framework that learns the interactive grasping of various geometrically distinct real-world objects by continuously controlling an anthropomorphic robotic hand. We explore several explicit representations of object geometry as input to the policy. Moreover, we propose to inform the policy implicitly through signed distances and show that this is naturally suited to guide the search through a shaped reward component. Finally, we demonstrate that the proposed framework is able to learn even in more challenging conditions, such as targeted grasping from a cluttered bin. Necessary pre-grasping behaviors such as object reorientation and utilization of environmental constraints emerge in this case. Videos of learned interactive policies are available at https://maltemosbach.github. io/geometry_aware_grasping_policies.
CVJan 30, 2023
Rendering the Directional TSDF for Tracking and Multi-Sensor Registration with Point-To-Plane Scale ICPMalte Splietker, Sven Behnke
Dense real-time tracking and mapping from RGB-D images is an important tool for many robotic applications, such as navigation and manipulation. The recently presented Directional Truncated Signed Distance Function (DTSDF) is an augmentation of the regular TSDF that shows potential for more coherent maps and improved tracking performance. In this work, we present methods for rendering depth- and color images from the DTSDF, making it a true drop-in replacement for the regular TSDF in established trackers. We evaluate the algorithm on well-established datasets and observe that our method improves tracking performance and increases re-usability of mapped scenes. Furthermore, we add color integration which notably improves color-correctness at adjacent surfaces. Our novel formulation of combined ICP with frame-to-keyframe photometric error minimization further improves tracking results. Lastly, we introduce Sim3 point-to-plane ICP for refining pose priors in a multi-sensor scenario with different scale factors.
CVNov 21, 2022
Object-level 3D Semantic Mapping using a Network of Smart Edge SensorsJulian Hau, Simon Bultmann, Sven Behnke
Autonomous robots that interact with their environment require a detailed semantic scene model. For this, volumetric semantic maps are frequently used. The scene understanding can further be improved by including object-level information in the map. In this work, we extend a multi-view 3D semantic mapping system consisting of a network of distributed smart edge sensors with object-level information, to enable downstream tasks that need object-level input. Objects are represented in the map via their 3D mesh model or as an object-centric volumetric sub-map that can model arbitrary object geometry when no detailed 3D model is available. We propose a keypoint-based approach to estimate object poses via PnP and refinement via ICP alignment of the 3D object model with the observed point cloud segments. Object instances are tracked to integrate observations over time and to be robust against temporary occlusions. Our method is evaluated on the public Behave dataset where it shows pose estimation accuracy within a few centimeters and in real-world experiments with the sensor network in a challenging lab environment where multiple chairs and a table are tracked through the scene online, in real time even under high occlusions.
CVJun 2, 2022
Predicting Physical Object Properties from VideoMartin Link, Max Schwarz, Sven Behnke
We present a novel approach to estimating physical properties of objects from video. Our approach consists of a physics engine and a correction estimator. Starting from the initial observed state, object behavior is simulated forward in time. Based on the simulated and observed behavior, the correction estimator then determines refined physical parameters for each object. The method can be iterated for increased precision. Our approach is generic, as it allows for the use of an arbitrary - not necessarily differentiable - physics engine and correction estimator. For the latter, we evaluate both gradient-free hyperparameter optimization and a deep convolutional neural network. We demonstrate faster and more robust convergence of the learned method in several simulated 2D scenarios focusing on bin situations.
CVMay 23, 2022
ConvPoseCNN2: Prediction and Refinement of Dense 6D Object PosesArul Selvam Periyasamy, Catherine Capellen, Max Schwarz et al.
Object pose estimation is a key perceptual capability in robotics. We propose a fully-convolutional extension of the PoseCNN method, which densely predicts object translations and orientations. This has several advantages such as improving the spatial resolution of the orientation predictions -- useful in highly-cluttered arrangements, significant reduction in parameters by avoiding full connectivity, and fast inference. We propose and discuss several aggregation methods for dense orientation predictions that can be applied as a post-processing step, such as averaging and clustering techniques. We demonstrate that our method achieves the same accuracy as PoseCNN on the challenging YCB-Video dataset and provide a detailed ablation study of several variants of our method. Finally, we demonstrate that the model can be further improved by inserting an iterative refinement module into the middle of the network, which enforces consistency of the prediction.
ROJul 31, 2023
Learning Generalizable Tool Use with Non-rigid Grasp-pose RegistrationMalte Mosbach, Sven Behnke
Tool use, a hallmark feature of human intelligence, remains a challenging problem in robotics due the complex contacts and high-dimensional action space. In this work, we present a novel method to enable reinforcement learning of tool use behaviors. Our approach provides a scalable way to learn the operation of tools in a new category using only a single demonstration. To this end, we propose a new method for generalizing grasping configurations of multi-fingered robotic hands to novel objects. This is used to guide the policy search via favorable initializations and a shaped reward signal. The learned policies solve complex tool use tasks and generalize to unseen tools at test time. Visualizations and videos of the trained policies are available at https://maltemosbach.github.io/generalizable_tool_use.
CVNov 21, 2022
Learning Implicit Probability Distribution Functions for Symmetric Orientation Estimation from RGB Images Without Pose LabelsArul Selvam Periyasamy, Luis Denninger, Sven Behnke
Object pose estimation is a necessary prerequisite for autonomous robotic manipulation, but the presence of symmetry increases the complexity of the pose estimation task. Existing methods for object pose estimation output a single 6D pose. Thus, they lack the ability to reason about symmetries. Lately, modeling object orientation as a non-parametric probability distribution on the SO(3) manifold by neural networks has shown impressive results. However, acquiring large-scale datasets to train pose estimation models remains a bottleneck. To address this limitation, we introduce an automatic pose labeling scheme. Given RGB-D images without object pose annotations and 3D object models, we design a two-stage pipeline consisting of point cloud registration and render-and-compare validation to generate multiple symmetrical pseudo-ground-truth pose labels for each image. Using the generated pose labels, we train an ImplicitPDF model to estimate the likelihood of an orientation hypothesis given an RGB image. An efficient hierarchical sampling of the SO(3) manifold enables tractable generation of the complete set of symmetries at multiple resolutions. During inference, the most likely orientation of the target object is estimated using gradient ascent. We evaluate the proposed automatic pose labeling scheme and the ImplicitPDF model on a photorealistic dataset and the T-Less dataset, demonstrating the advantages of the proposed method.
CVSep 27, 2023
Learning from SAM: Harnessing a Foundation Model for Sim2Real Adaptation by RegularizationMayara E. Bonani, Max Schwarz, Sven Behnke
Domain adaptation is especially important for robotics applications, where target domain training data is usually scarce and annotations are costly to obtain. We present a method for self-supervised domain adaptation for the scenario where annotated source domain data (e.g. from synthetic generation) is available, but the target domain data is completely unannotated. Our method targets the semantic segmentation task and leverages a segmentation foundation model (Segment Anything Model) to obtain segment information on unannotated data. We take inspiration from recent advances in unsupervised local feature learning and propose an invariance-variance loss over the detected segments for regularizing feature representations in the target domain. Crucially, this loss structure and network architecture can handle overlapping segments and oversegmentation as produced by Segment Anything. We demonstrate the advantage of our method on the challenging YCB-Video and HomebrewedDB datasets and show that it outperforms prior work and, on YCB-Video, even a network trained with real annotations. Additionally, we provide insight through model ablations and show applicability to a custom robotic application.
ROApr 1
OMCL: Open-vocabulary Monte Carlo LocalizationEvgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke
Robust robot localization is an important prerequisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. These open-vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
CVSep 26, 2024
DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic ModelsHelin Cao, Sven Behnke
Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle's surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets, and it achieves state-of-the-art performance for SSC, surpassing most existing methods.
ROApr 13
Perception-aware Exploration for Consumer-grade UAVsSvetlana Seliunina, Daniel Schleich, Sven Behnke
In our work, we extend the current state-of-the-art approach for autonomous multi-UAV exploration to consumer-level UAVs, such as the DJI Mini 3 Pro. We propose a pipeline that selects viewpoint pairs from which the depth can be estimated and plans the trajectory that satisfies motion constraints necessary for odometry estimation. For the multi-UAV exploration, we propose a semi-distributed communication scheme that distributes the workload in a balanced manner. We evaluate our model performance in simulation for different numbers of UAVs and prove its ability to safely explore the environment and reconstruct the map even with the hardware limitations of consumer-grade UAVs.
LGMay 14
Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric RepresentationsJonathan Spieler, Angel Villar-Corrales, Sven Behnke
Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.
ROJan 11, 2022Code
Target Chase, Wall Building, and Fire Fighting: Autonomous UAVs of Team NimbRo at MBZIRC 2020Marius Beul, Max Schwarz, Jan Quenzel et al.
The Mohamed Bin Zayed International Robotics Challenge (MBZIRC) 2020 posed diverse challenges for unmanned aerial vehicles (UAVs). We present our four tailored UAVs, specifically developed for individual aerial-robot tasks of MBZIRC, including custom hardware- and software components. In Challenge 1, a target UAV is pursued using a high-efficiency, onboard object detection pipeline to capture a ball from the target UAV. A second UAV uses a similar detection method to find and pop balloons scattered throughout the arena. For Challenge 2, we demonstrate a larger UAV capable of autonomous aerial manipulation: Bricks are found and tracked from camera images. Subsequently, they are approached, picked, transported, and placed on a wall. Finally, in Challenge 3, our UAV autonomously finds fires using LiDAR and thermal cameras. It extinguishes the fires with an onboard fire extinguisher. While every robot features task-specific subsystems, all UAVs rely on a standard software stack developed for this particular and future competitions. We present our mostly open-source software solutions, including tools for system configuration, monitoring, robust wireless communication, high-level control, and agile trajectory generation. For solving the MBZIRC 2020 tasks, we advanced the state of the art in multiple research areas like machine vision and trajectory generation. We present our scientific contributions that constitute the foundation for our algorithms and systems and analyze the results from the MBZIRC competition 2020 in Abu Dhabi, where our systems reached second place in the Grand Challenge. Furthermore, we discuss lessons learned from our participation in this complex robotic challenge.
CVAug 9, 2021Code
NeuralMVS: Bridging Multi-View Stereo and Novel View SynthesisRadu Alexandru Rosu, Sven Behnke
Multi-View Stereo (MVS) is a core task in 3D computer vision. With the surge of novel deep learning methods, learned MVS has surpassed the accuracy of classical approaches, but still relies on building a memory intensive dense cost volume. Novel View Synthesis (NVS) is a parallel line of research and has recently seen an increase in popularity with Neural Radiance Field (NeRF) models, which optimize a per scene radiance field. However, NeRF methods do not generalize to novel scenes and are slow to train and test. We propose to bridge the gap between these two methodologies with a novel network that can recover 3D scene geometry as a distance function, together with high-resolution color images. Our method uses only a sparse set of images as input and can generalize well to novel scenes. Additionally, we propose a coarse-to-fine sphere tracing approach in order to significantly increase speed. We show on various datasets that our method reaches comparable accuracy to per-scene optimized methods while being able to generalize and running significantly faster. We provide the source code at https://github.com/AIS-Bonn/neural_mvs
CVJun 24, 2021Code
FaDIV-Syn: Fast Depth-Independent View Synthesis using Soft Masks and Implicit BlendingAndre Rochow, Max Schwarz, Michael Weinmann et al.
Novel view synthesis is required in many robotic applications, such as VR teleoperation and scene reconstruction. Existing methods are often too slow for these contexts, cannot handle dynamic scenes, and are limited by their explicit depth estimation stage, where incorrect depth predictions can lead to large projection errors. Our proposed method runs in real time on live streaming data and avoids explicit depth estimation by efficiently warping input images into the target frame for a range of assumed depth planes. The resulting plane sweep volume (PSV) is directly fed into our network, which first estimates soft PSV masks in a self-supervised manner, and then directly produces the novel output view. This improves efficiency and performance on transparent, reflective, thin, and feature-less scene parts. FaDIV-Syn can perform both interpolation and extrapolation tasks at 540p in real-time and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. We thoroughly evaluate ablations, such as removing the Soft-Masking network, training from fewer examples as well as generalization to higher resolutions and stronger depth discretization. Our implementation is available.
ROOct 19, 2018Code
NimbRo-OP2X: Adult-sized Open-source 3D Printed Humanoid RobotGrzegorz Ficht, Hafez Farazi, André Brandenburger et al.
Humanoid robotics research depends on capable robot platforms, but recently developed advanced platforms are often not available to other research groups, expensive, dangerous to operate, or closed-source. The lack of available platforms forces researchers to work with smaller robots, which have less strict dynamic constraints or with simulations, which lack many real-world effects. We developed NimbRo-OP2X to address this need. At a height of 135 cm our robot is large enough to interact in a human environment. Its low weight of only 19 kg makes the operation of the robot safe and easy, as no special operational equipment is necessary. Our robot is equipped with a fast onboard computer and a GPU to accelerate parallel computations. We extend our already open-source software by a deep-learning based vision system and gait parameter optimisation. The NimbRo-OP2X was evaluated during RoboCup 2018 in Montréal, Canada, where it won all possible awards in the Humanoid AdultSize class.
ROSep 28, 2018Code
NimbRo-OP2: Grown-up 3D Printed Open Humanoid Platform for ResearchGrzegorz Ficht, Philipp Allgeuer, Hafez Farazi et al.
The versatility of humanoid robots in locomotion, full-body motion, interaction with unmodified human environments, and intuitive human-robot interaction led to increased research interest. Multiple smaller platforms are available for research, but these require a miniaturized environment to interact with---and often the small scale of the robot diminishes the influence of factors which would have affected larger robots. Unfortunately, many research platforms in the larger size range are less affordable, more difficult to operate, maintain and modify, and very often closed-source. In this work, we introduce NimbRo-OP2X, an affordable, fully open-source platform in terms of both hardware and software. Being almost 135cm tall and only 18kg in weight, the robot is not only capable of interacting in an environment meant for humans, but also easy and safe to operate and does not require a gantry when doing so. The exoskeleton of the robot is 3D printed, which produces a lightweight and visually appealing design. We present all mechanical and electrical aspects of the robot, as well as some of the software features of our well-established open-source ROS software. The NimbRo-OP2X performed at RoboCup 2017 in Nagoya, Japan, where it won the Humanoid League AdultSize Soccer competition and Technical Challenge.
ROSep 28, 2018Code
First International HARTING Open Source Prize Winner: The igus Humanoid Open PlatformPhilipp Allgeuer, Grzegorz Ficht, Hafez Farazi et al.
The use of standard platforms in the field of humanoid robotics can lower the entry barrier for new research groups, and accelerate research by the facilitation of code sharing. Numerous humanoid standard platforms exist in the lower size ranges of up to 60cm, but beyond that humanoid robots scale up quickly in weight and price, becoming less affordable and more difficult to operate, maintain and modify. The igus Humanoid Open Platform is an affordable, fully open-source platform for humanoid research. At 92cm, the robot is capable of acting in an environment meant for humans, and is equipped with enough sensors, actuators and computing power to support researchers in many fields. The structure of the robot is entirely 3D printed, leading to a lightweight and visually appealing design. This paper covers the mechanical and electrical aspects of the robot, as well as the main features of the corresponding open-source ROS software. At RoboCup 2016, the platform was awarded the first International HARTING Open Source Prize.
ROSep 28, 2018Code
The igus Humanoid Open Platform: A Child-sized 3D Printed Open-Source Robot for ResearchPhilipp Allgeuer, Hafez Farazi, Grzegorz Ficht et al.
The use of standard robotic platforms can accelerate research and lower the entry barrier for new research groups. There exist many affordable humanoid standard platforms in the lower size ranges of up to 60cm, but larger humanoid robots quickly become less affordable and more difficult to operate, maintain and modify. The igus Humanoid Open Platform is a new and affordable, fully open-source humanoid platform. At 92cm in height, the robot is capable of interacting in an environment meant for humans, and is equipped with enough sensors, actuators and computing power to support researchers in many fields. The structure of the robot is entirely 3D printed, leading to a lightweight and visually appealing design. The main features of the platform are described in this article.
ROSep 27, 2018Code
Child-sized 3D Printed igus Humanoid Open PlatformPhilipp Allgeuer, Hafez Farazi, Michael Schreiber et al.
The use of standard platforms in the field of humanoid robotics can accelerate research, and lower the entry barrier for new research groups. While many affordable humanoid standard platforms exist in the lower size ranges of up to 60cm, beyond this the few available standard platforms quickly become significantly more expensive, and difficult to operate and maintain. In this paper, the igus Humanoid Open Platform is presented---a new, affordable, versatile and easily customisable standard platform for humanoid robots in the child-sized range. At 90cm, the robot is large enough to interact with a human-scale environment in a meaningful way, and is equipped with enough torque and computing power to foster research in many possible directions. The structure of the robot is entirely 3D printed, allowing for a lightweight and appealing design. The electrical and mechanical designs of the robot are presented, and the main features of the corresponding open-source ROS software are discussed. The 3D CAD files for all of the robot parts have been released open-source in conjunction with this paper.
ROSep 27, 2018Code
Robust Sensor Fusion for Robot Attitude EstimationPhilipp Allgeuer, Sven Behnke
Knowledge of how a body is oriented relative to the world is frequently invaluable information in the field of robotics. An attitude estimator that fuses 3-axis gyroscope, accelerometer and magnetometer data into a quaternion orientation estimate is presented in this paper. The concept of fused yaw, used by the estimator, is also introduced. The estimator, a nonlinear complementary filter at heart, is designed to be uniformly robust and stable---independent of the absolute orientation of the body---and has been implemented and released as a cross-platform open source C++ library. Extensions to the estimator, such as quick learning and the ability to deal dynamically with cases of reduced sensory information, are also presented.
ROApr 12
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAMEvgenii Kruzhkov, Sven Behnke
Feed-forward geometric foundation models can infer dense point clouds and camera motion directly from RGB streams, providing priors for monocular SLAM. However, their predictions are often view-dependent and noisy: geometry can vary across viewpoints and under image transformations, and local metric properties may drift between frames. We present MonoEM-GS, a monocular mapping pipeline that integrates such geometric predictions into a global Gaussian Splatting representation while explicitly addressing these inconsistencies. MonoEM-GS couples Gaussian Splatting with an Expectation--Maximization formulation to stabilize geometry, and employs ICP-based alignment for monocular pose estimation. Beyond geometry, MonoEM-GS parameterizes Gaussians with multi-modal features, enabling in-place open-set segmentation and other downstream queries directly on the reconstructed map. We evaluate MonoEM-GS on 7-Scenes, TUM RGB-D and Replica, and compare against recent baselines.
CVApr 15, 2024
FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression FeaturesAndre Rochow, Max Schwarz, Sven Behnke
The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
CVMar 13, 2024
SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-NetHelin Cao, Sven Behnke
We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
ROMar 15, 2024
Grasp Anything: Combining Teacher-Augmented Policy Gradient Learning with Instance Segmentation to Grasp Arbitrary ObjectsMalte Mosbach, Sven Behnke
Interactive grasping from clutter, akin to human dexterity, is one of the longest-standing problems in robot learning. Challenges stem from the intricacies of visual perception, the demand for precise motor skills, and the complex interplay between the two. In this work, we present Teacher-Augmented Policy Gradient (TAPG), a novel two-stage learning framework that synergizes reinforcement learning and policy distillation. After training a teacher policy to master the motor control based on object pose information, TAPG facilitates guided, yet adaptive, learning of a sensorimotor policy, based on object segmentation. We zero-shot transfer from simulation to a real robot by using Segment Anything Model for promptable object segmentation. Our trained policies adeptly grasp a wide variety of objects from cluttered scenarios in simulation and the real world based on human-understandable prompts. Furthermore, we show robust zero-shot transfer to novel objects. Videos of our experiments are available at \url{https://maltemosbach.github.io/grasp_anything}.
CVFeb 11, 2025
PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and PlanningAngel Villar-Corrales, Sven Behnke
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on videos and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows the generation of multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available on https://play-slot.github.io/PlaySlot/.
CVFeb 2, 2024
Spiking CenterNet: A Distillation-boosted Spiking Neural Network for Object DetectionLennard Bodden, Franziska Schwaiger, Duc Bach Ha et al.
In the era of AI at the edge, self-driving cars, and climate change, the need for energy-efficient, small, embedded AI is growing. Spiking Neural Networks (SNNs) are a promising approach to address this challenge, with their event-driven information flow and sparse activations. We propose Spiking CenterNet for object detection on event data. It combines an SNN CenterNet adaptation with an efficient M2U-Net-based decoder. Our model significantly outperforms comparable previous work on Prophesee's challenging GEN1 Automotive Detection Dataset while using less than half the energy. Distilling the knowledge of a non-spiking teacher into our SNN further increases performance. To the best of our knowledge, our work is the first approach that takes advantage of knowledge distillation in the field of spiking object detection.
ROOct 30, 2024
A Comparison of Prompt Engineering Techniques for Task Planning and Execution in Service RoboticsJonas Bode, Bastian Pätzold, Raphael Memmesheimer et al.
Recent advances in LLM have been instrumental in autonomous robot control and human-robot interaction by leveraging their vast general knowledge and capabilities to understand and reason across a wide range of tasks and scenarios. Previous works have investigated various prompt engineering techniques for improving the performance of LLM to accomplish tasks, while others have proposed methods that utilize LLMs to plan and execute tasks based on the available functionalities of a given robot platform. In this work, we consider both lines of research by comparing prompt engineering techniques and combinations thereof within the application of high-level task planning and execution in service robotics. We define a diverse set of tasks and a simple set of functionalities in simulation, and measure task completion accuracy and execution time for several state-of-the-art models.
LGOct 11, 2024
SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from PixelsMalte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales et al.
Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment's state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of benchmark robotic environments that require relational reasoning and manipulation capabilities. Videos are available at https://slot-latent-dynamics.github.io/.
CVJun 23, 2025
OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric AwarenessHelin Cao, Sven Behnke
Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.
CVJun 23, 2025
SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous DrivingHelin Cao, Rafael Materla, Sven Behnke
Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.
CVDec 15, 2023
Attention-Based VR Facial Animation with Visual Mouth Camera Guidance for Immersive Telepresence AvatarsAndre Rochow, Max Schwarz, Sven Behnke
Facial animation in virtual reality environments is essential for applications that necessitate clear visibility of the user's face and the ability to convey emotional signals. In our scenario, we animate the face of an operator who controls a robotic Avatar system. The use of facial animation is particularly valuable when the perception of interacting with a specific individual, rather than just a robot, is intended. Purely keypoint-driven animation approaches struggle with the complexity of facial movements. We present a hybrid method that uses both keypoints and direct visual guidance from a mouth camera. Our method generalizes to unseen operators and requires only a quick enrolment step with capture of two short videos. Multiple source images are selected with the intention to cover different facial expressions. Given a mouth camera frame from the HMD, we dynamically construct the target keypoints and apply an attention mechanism to determine the importance of each source image. To resolve keypoint ambiguities and animate a broader range of mouth expressions, we propose to inject visual mouth camera information into the latent space. We enable training on large-scale speaking head datasets by simulating the mouth camera input with its perspective differences and facial deformations. Our method outperforms a baseline in quality, capability, and temporal consistency. In addition, we highlight how the facial animation contributed to our victory at the ANA Avatar XPRIZE Finals.
CVApr 9, 2024
Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic GraspingAnas Gouda, Max Schwarz, Christopher Reining et al.
Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.