CVSep 21, 2023
NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance FieldsFloris Erich, Naoya Chiba, Yusuke Yoshiyasu et al.
We present NeuralLabeling, a labeling approach and toolset for annotating 3D scenes using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps, and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as a renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach. We also show how instance segmentation and depth completion datasets generated using NeuralLabeling can be incorporated into a robot application for grasping transparent objects placed in a dishwasher with an accuracy of 83.3%, compared to 16.3% without depth completion.
CVAug 22, 2025Code
NeuralMeshing: Complete Object Mesh Extraction from Casual CapturesFloris Erich, Naoya Chiba, Abdullah Mustafa et al.
How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.
CVJan 4, 2024
PEGASUS: Physically Enhanced Gaussian Splatting Simulation System for 6DoF Object Pose Dataset GenerationLukas Meyer, Floris Erich, Yusuke Yoshiyasu et al.
We introduce Physically Enhanced Gaussian Splatting Simulation System (PEGASUS) for 6DOF object pose dataset generation, a versatile dataset generator based on 3D Gaussian Splatting. Environment and object representations can be easily obtained using commodity cameras to reconstruct with Gaussian Splatting. <i>PEGASUS</i> allows the composition of new scenes by merging the respective underlying Gaussian Splatting point cloud of an environment with one or multiple objects. Leveraging a physics engine enables the simulation of natural object placement within a scene through interaction between meshes extracted for the objects and the environment. Consequently, an extensive amount of new scenes - static or dynamic - can be created by combining different environments and objects. By rendering scenes from various perspectives, diverse data points such as RGB images, depth maps, semantic masks, and 6DoF object poses can be extracted. Our study demonstrates that training on data generated by PEGASUS enables pose estimation networks to successfully transfer from synthetic data to real-world data. Moreover, we introduce the Ramen dataset, comprising 30 Japanese cup noodle items. This dataset includes spherical scans that captures images from both object hemisphere and the Gaussian Splatting reconstruction, making them compatible with PEGASUS.
ROMar 18, 2025
Learning Bimanual Manipulation via Action Chunking and Inter-Arm Coordination with TransformersTomohiro Motoda, Ryo Hanai, Ryoichi Nakajo et al.
Robots that can operate autonomously in a human living environment are necessary to have the ability to handle various tasks flexibly. One crucial element is coordinated bimanual movements that enable functions that are difficult to perform with one hand alone. In recent years, learning-based models that focus on the possibilities of bimanual movements have been proposed. However, the high degree of freedom of the robot makes it challenging to reason about control, and the left and right robot arms need to adjust their actions depending on the situation, making it difficult to realize more dexterous tasks. To address the issue, we focus on coordination and efficiency between both arms, particularly for synchronized actions. Therefore, we propose a novel imitation learning architecture that predicts cooperative actions. We differentiate the architecture for both arms and add an intermediate encoder layer, Inter-Arm Coordinated transformer Encoder (IACE), that facilitates synchronization and temporal alignment to ensure smooth and coordinated actions. To verify the effectiveness of our architectures, we perform distinctive bimanual tasks. The experimental results showed that our model demonstrated a high success rate for comparison and suggested a suitable architecture for the policy learning of bimanual manipulation.
RONov 15, 2017
A Systematic Literature Review of Experiments in Socially Assistive Robotics using Humanoid RobotsFloris Erich, Masakazu Hirokawa, Kenji Suzuki
We perform a Systematic Literature Review to discover how Humanoid robots are being applied in Socially Assistive Robotics experiments. Our search returned 24 papers, from which 16 were included for closer analysis. To do this analysis we used a conceptual framework inspired by Behavior-based Robotics. We were interested in finding out which robot was used (most use the robot NAO), what the goals of the application were (teaching, assisting, playing, instructing), how the robot was controlled (manually in most of the experiments), what kind of behaviors the robot exhibited (reacting to touch, pointing at body parts, singing a song, dancing, among others), what kind of actuators the robot used (always motors, sometimes speakers, hardly ever any other type of actuator) and what kind of sensors the robot used (in many studies the robot did not use any sensors at all, in others the robot frequently used camera and/or microphone). The results of this study can be used for designing software frameworks targeting Humanoid Socially Assistive Robotics, especially in the context of Software Product Line Engineering projects.