ROMar 10, 2022
Cluttered Food Grasping with Adaptive Fingers and Synthetic-Data Trained Object DetectionAvinash Ummadisingu, Kuniyuki Takahashi, Naoki Fukaya
The food packaging industry handles an immense variety of food products with wide-ranging shapes and sizes, even within one kind of food. Menus are also diverse and change frequently, making automation of pick-and-place difficult. A popular approach to bin-picking is to first identify each piece of food in the tray by using an instance segmentation method. However, human annotations to train these methods are unreliable and error-prone since foods are packed close together with unclear boundaries and visual similarity making separation of pieces difficult. To address this problem, we propose a method that trains purely on synthetic data and successfully transfers to the real world using sim2real methods by creating datasets of filled food trays using high-quality 3d models of real pieces of food for the training instance segmentation models. Another concern is that foods are easily damaged during grasping. We address this by introducing two additional methods -- a novel adaptive finger mechanism to passively retract when a collision occurs, and a method to filter grasps that are likely to cause damage to neighbouring pieces of food during a grasp. We demonstrate the effectiveness of the proposed method on several kinds of real foods.
ROMar 28, 2024
SAID-NeRF: Segmentation-AIDed NeRF for Depth Completion of Transparent ObjectsAvinash Ummadisingu, Jongkeum Choi, Koki Yamane et al.
Acquiring accurate depth information of transparent objects using off-the-shelf RGB-D cameras is a well-known challenge in Computer Vision and Robotics. Depth estimation/completion methods are typically employed and trained on datasets with quality depth labels acquired from either simulation, additional sensors or specialized data collection setups and known 3d models. However, acquiring reliable depth information for datasets at scale is not straightforward, limiting training scalability and generalization. Neural Radiance Fields (NeRFs) are learning-free approaches and have demonstrated wide success in novel view synthesis and shape recovery. However, heuristics and controlled environments (lights, backgrounds, etc) are often required to accurately capture specular surfaces. In this paper, we propose using Visual Foundation Models (VFMs) for segmentation in a zero-shot, label-free way to guide the NeRF reconstruction process for these objects via the simultaneous reconstruction of semantic fields and extensions to increase robustness. Our proposed method Segmentation-AIDed NeRF (SAID-NeRF) shows significant performance on depth completion datasets for transparent objects and robotic grasping.
51.7CVApr 21
PLaMo 2.1-VL Technical ReportTommi Kerola, Yuya Masuda, Takashi Masuko et al.
We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.
CLSep 5, 2025
PLaMo 2 Technical ReportPreferred Networks, Kaizaburo Chubachi, Yasuhiro Fujita et al.
In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
ROFeb 26, 2022
Learning-based Collision-free Planning on Arbitrary Optimization Criteria in the Latent Space through cGANsTomoki Ando, Hiroto Iino, Hiroki Mori et al.
We propose a new method for collision-free planning using Conditional Generative Adversarial Networks (cGANs) to transform between the robot's joint space and a latent space that captures only collision-free areas of the joint space, conditioned by an obstacle map. Generating multiple plausible trajectories is convenient in applications such as the manipulation of a robot arm by enabling the selection of trajectories that avoids collision with the robot or surrounding environment. In the proposed method, various trajectories that avoid obstacles can be generated by connecting the start and goal state with arbitrary line segments in this generated latent space. Our method provides this collision-free latent space, after which any planner, using any optimization conditions, can be used to generate the most suitable paths on the fly. We successfully verified this method with a simulated and actual UR5e 6-DoF robotic arm. We confirmed that different trajectories could be generated depending on optimization conditions.
ROFeb 15, 2022
Collision-free Path Planning in the Latent Space through cGANsTomoki Ando, Hiroki Mori, Ryota Torishima et al.
We show a new method for collision-free path planning by cGANs by mapping its latent space to only the collision-free areas of the robot joint space. Our method simply provides this collision-free latent space after which any planner, using any optimization conditions, can be used to generate the most suitable paths on the fly. We successfully verified this method with a simulated two-link robot arm.
ROJan 4, 2022
Target-mass Grasping of Entangled Food using Pre-grasping & Post-graspingKuniyuki Takahashi, Naoki Fukaya, Avinash Ummadisingu
Food packing industries typically use seasonal ingredients with immense variety that factory workers manually pack. For small pieces of food picked by volume or weight that tend to get entangled, stick or clump together, it is difficult to predict how intertwined they are from a visual examination, making it a challenge to grasp the requisite target mass accurately. Workers rely on a combination of weighing scales and a sequence of complex maneuvers to separate out the food and reach the target mass. This makes automation of the process a non-trivial affair. In this study, we propose methods that combines 1) pre-grasping to reduce the degree of the entanglement, 2) post-grasping to adjust the grasped mass using a novel gripper mechanism to carefully discard excess food when the grasped amount is larger than the target mass, and 3) selecting the grasping point to grasp an amount likely to be reasonably higher than target grasping mass with confidence. We evaluate the methods on a variety of foods that entangle, stick and clump, each of which has a different size, shape, and material properties such as volumetric mass density. We show significant improvement in grasp accuracy of user-specified target masses using our proposed methods.
ROMay 27, 2021
Uncertainty-Aware Self-Supervised Target-Mass Grasping of Granular FoodsKuniyuki Takahashi, Wilson Ko, Avinash Ummadisingu et al.
Food packing industry workers typically pick a target amount of food by hand from a food tray and place them in containers. Since menus are diverse and change frequently, robots must adapt and learn to handle new foods in a short time-span. Learning to grasp a specific amount of granular food requires a large training dataset, which is challenging to collect reasonably quickly. In this study, we propose ways to reduce the necessary amount of training data by augmenting a deep neural network with models that estimate its uncertainty through self-supervised learning. To further reduce human effort, we devise a data collection system that automatically generates labels. We build on the idea that we can grasp sufficiently well if there is at least one low-uncertainty (high-confidence) grasp point among the various grasp point candidates. We evaluate the methods we propose in this work on a variety of granular foods -- coffee beans, rice, oatmeal and peanuts -- each of which has a different size, shape and material properties such as volumetric mass density or friction. For these foods, we show significantly improved grasp accuracy of user-specified target masses using smaller datasets by incorporating uncertainty.
ROSep 27, 2019
Deep Gated Multi-modal Learning: In-hand Object Pose Changes Estimation using Tactile and Image DataTomoki Anzai, Kuniyuki Takahashi
For in-hand manipulation, estimation of the object pose inside the hand is one of the important functions to manipulate objects to the target pose. Since in-hand manipulation tends to cause occlusions by the hand or the object itself, image information only is not sufficient for in-hand object pose estimation. Multiple modalities can be used in this case, the advantage is that other modalities can compensate for occlusion, noise, and sensor malfunctions. Even though deciding the utilization rate of a modality (referred to as reliability value) corresponding to the situations is important, the manual design of such models is difficult, especially for various situations. In this paper, we propose deep gated multi-modal learning, which self-determines the reliability value of each modality through end-to-end deep learning. For the experiments, an RGB camera and a GelSight tactile sensor were attached to the parallel gripper of the Sawyer robot, and the object pose changes were estimated during grasping. A total of 15 objects were used in the experiments. In the proposed model, the reliability values of the modalities were determined according to the noise level and failure of each modality, and it was confirmed that the pose change was estimated even for unknown objects.
ROSep 27, 2019
Invisible Marker: Automatic Annotation of Segmentation Masks for Object ManipulationKuniyuki Takahashi, Kenta Yonekura
We propose a method to annotate segmentation masks accurately and automatically using invisible marker for object manipulation. Invisible marker is invisible under visible (regular) light conditions, but becomes visible under invisible light, such as ultraviolet (UV) light. By painting objects with the invisible marker, and by capturing images while alternately switching between regular and UV light at high speed, massive annotated datasets are created quickly and inexpensively. We show a comparison between our proposed method and manual annotations. We demonstrate semantic segmentation for deformable objects including clothes, liquids, and powders under controlled environmental light conditions. In addition, we show demonstrations of liquid pouring tasks under uncontrolled environmental light conditions in complex environments such as inside the office, house, and outdoors. Furthermore, it is possible to capture data while the camera is in motion so it becomes easier to capture large datasets, as shown in our demonstration.
ROMar 9, 2018
Deep Visuo-Tactile Learning: Estimation of Tactile Properties from ImagesKuniyuki Takahashi, Jethro Tan
Estimation of tactile properties from vision, such as slipperiness or roughness, is important to effectively interact with the environment. These tactile properties help us decide which actions we should choose and how to perform them. E.g., we can drive slower if we see that we have bad traction or grasp tighter if an item looks slippery. We believe that this ability also helps robots to enhance their understanding of the environment, and thus enables them to tailor their actions to the situation at hand. We therefore propose a model to estimate the degree of tactile properties from visual perception alone (e.g., the level of slipperiness or roughness). Our method extends a encoder-decoder network, in which the latent variables are visual and tactile features. In contrast to previous works, our method does not require manual labeling, but only RGB images and the corresponding tactile sensor data. All our data is collected with a webcam and uSkin tactile sensor mounted on the end-effector of a Sawyer robot, which strokes the surfaces of 25 different materials. We show that our model generalizes to materials not included in the training data by evaluating the feature space, indicating that it has learned to associate important tactile properties with images.
ROOct 17, 2017
Interactively Picking Real-World Objects with Unconstrained Spoken Language InstructionsJun Hatori, Yuta Kikuchi, Sosuke Kobayashi et al.
Comprehension of spoken natural language is an essential component for robots to communicate with human effectively. However, handling unconstrained spoken instructions is challenging due to (1) complex structures including a wide variety of expressions used in spoken language and (2) inherent ambiguity in interpretation of human instructions. In this paper, we propose the first comprehensive system that can handle unconstrained spoken language and is able to effectively resolve ambiguity in spoken instructions. Specifically, we integrate deep-learning-based object detection together with natural language processing technologies to handle unconstrained spoken instructions, and propose a method for robots to resolve instruction ambiguity through dialogue. Through our experiments on both a simulated environment as well as a physical industrial robot arm, we demonstrate the ability of our system to understand natural instructions from human operators effectively, and how higher success rates of the object picking task can be achieved through an interactive clarification process.
ROOct 17, 2017
Map-based Multi-Policy Reinforcement Learning: Enhancing Adaptability of Robots by Deep Reinforcement LearningAyaka Kume, Eiichi Matsumoto, Kuniyuki Takahashi et al.
In order for robots to perform mission-critical tasks, it is essential that they are able to quickly adapt to changes in their environment as well as to injuries and or other bodily changes. Deep reinforcement learning has been shown to be successful in training robot control policies for operation in complex environments. However, existing methods typically employ only a single policy. This can limit the adaptability since a large environmental modification might require a completely different behavior compared to the learning environment. To solve this problem, we propose Map-based Multi-Policy Reinforcement Learning (MMPRL), which aims to search and store multiple policies that encode different behavioral features while maximizing the expected reward in advance of the environment change. Thanks to these policies, which are stored into a multi-dimensional discrete map according to its behavioral feature, adaptation can be performed within reasonable time without retraining the robot. An appropriate pre-trained policy from the map can be recalled using Bayesian optimization. Our experiments show that MMPRL enables robots to quickly adapt to large changes without requiring any prior knowledge on the type of injuries that could occur. A highlight of the learned behaviors can be found here: https://youtu.be/QwInbilXNOE .