Hanbo Zhang

RO
h-index17
23papers
1,389citations
Novelty50%
AI Score60

23 Papers

CVJul 5, 2023Code
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Yan Zeng, Hanbo Zhang, Jiani Zheng et al. · bytedance

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

RONov 2, 2023Code
Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang et al.

Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.

RODec 18, 2024Code
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Xinghang Li, Peiyan Li, Minghuan Liu et al.

Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we need VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io.

90.4DCMay 22
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Zewen Jin, Congkun Ai, Guangpeng Zhang et al.

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.

ROOct 18, 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Hanbo Zhang, Jie Xu, Yuchen Mo et al.

Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, \invig, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the \invig dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the \invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: \href{https://openivg.github.io}{https://openivg.github.io}.

CVMay 4, 2025Code
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun, Sicheng Tao, Jungang Li et al.

Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

ROAug 19, 2025Code
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence

Chao Tang, Anxing Xiao, Yuhong Deng et al.

Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

LGApr 5, 2025Code
MInCo: Mitigating Information Conflicts in Distracted Visual Model-based Reinforcement Learning

Shiguang Sun, Hanbo Zhang, Zeyang Liu et al.

Existing visual model-based reinforcement learning (MBRL) algorithms with observation reconstruction often suffer from information conflicts, making it difficult to learn compact representations and hence result in less robust policies, especially in the presence of task-irrelevant visual distractions. In this paper, we first reveal that the information conflicts in current visual MBRL algorithms stem from visual representation learning and latent dynamics modeling with an information-theoretic perspective. Based on this finding, we present a new algorithm to resolve information conflicts for visual MBRL, named MInCo, which mitigates information conflicts by leveraging negative-free contrastive learning, aiding in learning invariant representation and robust policies despite noisy observations. To prevent the dominance of visual representation learning, we introduce time-varying reweighting to bias the learning towards dynamics modeling as training proceeds. We evaluate our method on several robotic control tasks with dynamic background distractions. Our experiments demonstrate that MInCo learns invariant representations against background noise and consistently outperforms current state-of-the-art visual MBRL methods. Code is available at https://github.com/ShiguangSun/minco.

ROApr 29, 2021Code
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter

Hanbo Zhang, Deyu Yang, Han Wang et al.

Despite the impressive progress achieved in robotic grasping, robots are not skilled in sophisticated tasks (e.g. search and grasp a specified target in clutter). Such tasks involve not only grasping but the comprehensive perception of the world (e.g. the object relationships). Recently, encouraging results demonstrate that it is possible to understand high-level concepts by learning. However, such algorithms are usually data-intensive, and the lack of data severely limits their performance. In this paper, we present a new dataset named REGRAD for the learning of relationships among objects and grasps. We collect the annotations of object poses, segmentations, grasps, and relationships for the target-driven relational grasping tasks. Our dataset is collected in both forms of 2D images and 3D point clouds. Moreover, since all the data are generated automatically, it is free to import new objects for data generation. We also released a real-world validation dataset to evaluate the sim-to-real performance of models trained on REGRAD. Finally, we conducted a series of experiments, showing that the models trained on REGRAD could generalize well to the realistic scenarios, in terms of both relationship and grasp detection. Our dataset and code could be found at: https://github.com/poisonwine/REGRAD

ROFeb 28, 2020Code
REGNet: REgion-based Grasp Network for End-to-end Grasp Detection in Point Clouds

Binglei Zhao, Hanbo Zhang, Xuguang Lan et al.

Reliable robotic grasping in unstructured environments is a crucial but challenging task. The main problem is to generate the optimal grasp of novel objects from partial noisy observations. This paper presents an end-to-end grasp detection network taking one single-view point cloud as input to tackle the problem. Our network includes three stages: Score Network (SN), Grasp Region Network (GRN), and Refine Network (RN). Specifically, SN regresses point grasp confidence and selects positive points with high confidence. Then GRN conducts grasp proposal prediction on the selected positive points. RN generates more accurate grasps by refining proposals predicted by GRN. To further improve the performance, we propose a grasp anchor mechanism, in which grasp anchors with assigned gripper orientations are introduced to generate grasp proposals. Experiments demonstrate that REGNet achieves a success rate of 79.34% and a completion rate of 96% in real-world clutter, which significantly outperforms several state-of-the-art point-cloud based methods, including GPD, PointNetGPD, and S4G. The code is available at https://github.com/zhaobinglei/REGNet_for_3D_Grasping.

16.5CLMay 5
Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs

Tianze Han, Beining Xu, Hanbo Zhang et al.

As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a direct LLM judge as a baseline that reads raw dialogue text and predicts whether the target response is harmful, productive, or neutral. We find that direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality because the target label depends on clinical direction: whether the response moves the user state toward regulation or reframing, leaves it broadly unchanged, or reinforces deterioration through higher risk affect or cognitive-distortion mass. To address this issue, we propose Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry. We evaluate DESG on a constructed diagnostic stress-test benchmark of 3{,}000 dialogue windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, covering peer support, counseling dialogue, and crisis-oriented interaction. On the 600-window held-out test aggregate, DESG-Ensemble achieves 0.9353 macro-F1, exceeding ConcatANN by 1.51 percentage points, BERTScore by 19.63 points, and TRACT by 33.81 points. Feature ablations, artifact controls, a 100-window blinded adjudicator audit, and qualitative disagreement cases indicate that the clinical state manifold is the main discriminative substrate, while graph-based trajectory components provide asymmetric scoring and interpretable diagnostics rather than serving as the sole source of performance.

63.0ROMay 4
Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu et al.

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures. We provide experimental results across 400 randomized simulation scenarios and real-world deployment on a Fetch mobile manipulator. Results show that our system achieves a success rate of 68.8\% and 58.0\% in unknown static and dynamic environments, respectively, significantly boosting success rates by 22.8\% and 18.0\% over the \nam approach in both unknown static and dynamic environments, with improved collision safety.

ROFeb 17, 2025
FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Chao Tang, Anxing Xiao, Yuhong Deng et al.

Learning tool use from a single human demonstration video offers a highly intuitive and efficient approach to robot teaching. While humans can effortlessly generalize a demonstrated tool manipulation skill to diverse tools that support the same function (e.g., pouring with a mug versus a teapot), current one-shot imitation learning (OSIL) methods struggle to achieve this. A key challenge lies in establishing functional correspondences between demonstration and test tools, considering significant geometric variations among tools with the same function (i.e., intra-function variations). To address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool Manipulation), an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation, enabling robots to generalize tool manipulation skills from a single human demonstration video to novel tools with the same function despite significant intra-function variations. With this formulation, we factorize FUNCTO into three stages: (1) functional keypoint extraction, (2) function-centric correspondence establishment, and (3) functional keypoint-based action planning. We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods through real-robot experiments on diverse tool manipulation tasks. The results demonstrate the superiority of FUNCTO when generalizing to novel tools with intra-function geometric variations. More details are available at https://sites.google.com/view/functo.

ROJun 11, 2025
Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Wenbo Zhang, Tianrun Hu, Yanyuan Qiao et al.

We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

ROFeb 8, 2022
Robotic Grasping from Classical to Modern: A Survey

Hanbo Zhang, Jian Tang, Shiguang Sun et al.

Robotic Grasping has always been an active topic in robotics since grasping is one of the fundamental but most challenging skills of robots. It demands the coordination of robotic perception, planning, and control for robustness and intelligence. However, current solutions are still far behind humans, especially when confronting unstructured scenarios. In this paper, we survey the advances of robotic grasping, starting from the classical formulations and solutions to the modern ones. By reviewing the history of robotic grasping, we want to provide a complete view of this community, and perhaps inspire the combination and fusion of different ideas, which we think would be helpful to touch and explore the essence of robotic grasping problems. In detail, we firstly give an overview of the analytic methods for robotic grasping. After that, we provide a discussion on the recent state-of-the-art data-driven grasping approaches rising in recent years. With the development of computer vision, semantic grasping is being widely investigated and can be the basis of intelligent manipulation and skill learning for autonomous robotic systems in the future. Therefore, in our survey, we also briefly review the recent progress in this topic. Finally, we discuss the open problems and the future research directions that may be important for the human-level robustness, autonomy, and intelligence of robots.

ROSep 18, 2021
Density-based Curriculum for Multi-goal Reinforcement Learning with Sparse Rewards

Deyu Yang, Hanbo Zhang, Xuguang Lan et al.

Multi-goal reinforcement learning (RL) aims to qualify the agent to accomplish multi-goal tasks, which is of great importance in learning scalable robotic manipulation skills. However, reward engineering always requires strenuous efforts in multi-goal RL. Moreover, it will introduce inevitable bias causing the suboptimality of the final policy. The sparse reward provides a simple yet efficient way to overcome such limits. Nevertheless, it harms the exploration efficiency and even hinders the policy from convergence. In this paper, we propose a density-based curriculum learning method for efficient exploration with sparse rewards and better generalization to desired goal distribution. Intuitively, our method encourages the robot to gradually broaden the frontier of its ability along the directions to cover the entire desired goal space as much and quickly as possible. To further improve data efficiency and generality, we augment the goals and transitions within the allowed region during training. Finally, We evaluate our method on diversified variants of benchmark manipulation tasks that are challenging for existing methods. Empirical results show that our method outperforms the state-of-the-art baselines in terms of both data efficiency and success rate.

ROAug 25, 2021
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

Hanbo Zhang, Yunfan Lu, Cunjun Yu et al.

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

LGJul 29, 2019
Hindsight Trust Region Policy Optimization

Hanbo Zhang, Site Bai, Xuguang Lan et al.

Reinforcement Learning(RL) with sparse rewards is a major challenge. We propose \emph{Hindsight Trust Region Policy Optimization}(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with \emph{hindsight} to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including ones not intended for the current task. HTRPO leverages two main ideas. It introduces QKL, a quadratic approximation to the KL divergence constraint on the trust region, leading to reduced variance in KL divergence estimation and improved stability in policy update. It also presents Hindsight Goal Filtering(HGF) to select conductive hindsight goals. In experiments, we evaluate HTRPO in various sparse reward tasks, including simple benchmarks, image-based Atari games, and simulated robot control. Ablation studies indicate that QKL and HGF contribute greatly to learning stability and high performance. Comparison results show that in all tasks, HTRPO consistently outperforms both TRPO and HPG, a state-of-the-art algorithm for RL with sparse rewards.

ROSep 19, 2018
A Multi-task Convolutional Neural Network for Autonomous Robotic Grasping in Object Stacking Scenes

Hanbo Zhang, Xuguang Lan, Site Bai et al.

Autonomous robotic grasping plays an important role in intelligent robotics. However, how to help the robot grasp specific objects in object stacking scenes is still an open problem, because there are two main challenges for autonomous robots: (1)it is a comprehensive task to know what and how to grasp; (2)it is hard to deal with the situations in which the target is hidden or covered by other objects. In this paper, we propose a multi-task convolutional neural network for autonomous robotic grasping, which can help the robot find the target, make the plan for grasping and finally grasp the target step by step in object stacking scenes. We integrate vision-based robotic grasping detection and visual manipulation relationship reasoning in one single deep network and build the autonomous robotic grasping system. Experimental results demonstrate that with our model, Baxter robot can autonomously grasp the target with a success rate of 90.6%, 71.9% and 59.4% in object cluttered scenes, familiar stacking scenes and complex stacking scenes respectively.

ROSep 8, 2018
A Real-time Robotic Grasp Approach with Oriented Anchor Box

Hanbo Zhang, Xinwen Zhou, Xuguang Lan et al.

Grasp is an essential skill for robots to interact with humans and the environment. In this paper, we build a vision-based, robust and real-time robotic grasp approach with fully convolutional neural network. The main component of our approach is a grasp detection network with oriented anchor boxes as detection priors. Because the orientation of detected grasps is significant, which determines the rotation angle configuration of the gripper, we propose the Orientation Anchor Box Mechanism to regress grasp angle based on predefined assumption instead of classification or regression without any priors. With oriented anchor boxes, the grasps can be predicted more accurately and efficiently. Besides, to accelerate the network training and further improve the performance of angle regression, Angle Matching is proposed during training instead of Jaccard Index Matching. Five-fold cross validation results demonstrate that our proposed algorithm achieves an accuracy of 98.8% and 97.8% in image-wise split and object-wise split respectively, and the speed of our detection algorithm is 67 FPS with GTX 1080Ti, outperforming all the current state-of-the-art grasp detection algorithms on Cornell Dataset both in speed and accuracy. Robotic experiments demonstrate the robustness and generalization ability in unseen objects and real-world environment, with the average success rate of 90.0% and 84.2% of familiar things and unseen things respectively on Baxter robot platform.

ROAug 30, 2018
ROI-based Robotic Grasp Detection for Object Overlapping Scenes

Hanbo Zhang, Xuguang Lan, Site Bai et al.

Grasp detection with consideration of the affiliations between grasps and their owner in object overlapping scenes is a necessary and challenging task for the practical use of the robotic grasping approach. In this paper, a robotic grasp detection algorithm named ROI-GD is proposed to provide a feasible solution to this problem based on Region of Interest (ROI), which is the region proposal for objects. ROI-GD uses features from ROIs to detect grasps instead of the whole scene. It has two stages: the first stage is to provide ROIs in the input image and the second-stage is the grasp detector based on ROI features. We also contribute a multi-object grasp dataset, which is much larger than Cornell Grasp Dataset, by labeling Visual Manipulation Relationship Dataset. Experimental results demonstrate that ROI-GD performs much better in object overlapping scenes and at the meantime, remains comparable with state-of-the-art grasp detection algorithms on Cornell Grasp Dataset and Jacquard Dataset. Robotic experiments demonstrate that ROI-GD can help robots grasp the target in single-object and multi-object scenes with the overall success rates of 92.5% and 83.8% respectively.

ROMar 6, 2018
Fully Convolutional Grasp Detection Network with Oriented Anchor Box

Xinwen Zhou, Xuguang Lan, Hanbo Zhang et al.

In this paper, we present a real-time approach to predict multiple grasping poses for a parallel-plate robotic gripper using RGB images. A model with oriented anchor box mechanism is proposed and a new matching strategy is used during the training process. An end-to-end fully convolutional neural network is employed in our work. The network consists of two parts: the feature extractor and multi-grasp predictor. The feature extractor is a deep convolutional neural network. The multi-grasp predictor regresses grasp rectangles from predefined oriented rectangles, called oriented anchor boxes, and classifies the rectangles into graspable and ungraspable. On the standard Cornell Grasp Dataset, our model achieves an accuracy of 97.74% and 96.61% on image-wise split and object-wise split respectively, and outperforms the latest state-of-the-art approach by 1.74% on image-wise split and 0.51% on object-wise split.

ROFeb 24, 2018
Visual Manipulation Relationship Network

Hanbo Zhang, Xuguang Lan, Xinwen Zhou et al.

Robotic grasping detection is one of the most important fields in robotics, in which great progress has been made recent years with the help of convolutional neural network (CNN). However, including multiple objects in one scene can invalidate the existing CNN-based grasping detection algorithms, because manipulation relationships among objects are not considered, which are required to guide the robot to grasp things in the right order. This paper presents a new CNN architecture called Visual Manipulation Relationship Network (VMRN) to help robot detect targets and predict the manipulation relationships in real time. To implement end-to-end training and meet real-time requirements in robot tasks, we propose the Object Pairing Pooling Layer (OP2L) to help to predict all manipulation relationships in one forward process. Moreover, in order to train VMRN, we collect a dataset named Visual Manipulation Relationship Dataset (VMRD) consisting of 5185 images with more than 17000 object instances and the manipulation relationships between all possible pairs of objects in every image, which is labeled by the manipulation relationship tree. The experimental results show that the new network architecture can detect objects and predict manipulation relationships simultaneously and meet the real-time requirements in robot tasks.