ROMar 29, 2023
Learning Complicated Manipulation Skills via Deterministic Policy with Limited DemonstrationsLiu Haofeng, Chen Yiwen, Tan Jiayi et al.
Combined with demonstrations, deep reinforcement learning can efficiently develop policies for manipulators. However, it takes time to collect sufficient high-quality demonstrations in practice. And human demonstrations may be unsuitable for robots. The non-Markovian process and over-reliance on demonstrations are further challenges. For example, we found that RL agents are sensitive to demonstration quality in manipulation tasks and struggle to adapt to demonstrations directly from humans. Thus it is challenging to leverage low-quality and insufficient demonstrations to assist reinforcement learning in training better policies, and sometimes, limited demonstrations even lead to worse performance. We propose a new algorithm named TD3fG (TD3 learning from a generator) to solve these problems. It forms a smooth transition from learning from experts to learning from experience. This innovation can help agents extract prior knowledge while reducing the detrimental effects of the demonstrations. Our algorithm performs well in Adroit manipulator and MuJoCo tasks with limited demonstrations.
CVMar 30, 2022Code
PEGG-Net: Pixel-Wise Efficient Grasp Generation in Complex ScenesHaozhe Wang, Zhiyang Liu, Lei Zhou et al.
Vision-based grasp estimation is an essential part of robotic manipulation tasks in the real world. Existing planar grasp estimation algorithms have been demonstrated to work well in relatively simple scenes. But when it comes to complex scenes, such as cluttered scenes with messy backgrounds and moving objects, the algorithms from previous works are prone to generate inaccurate and unstable grasping contact points. In this work, we first study the existing planar grasp estimation algorithms and analyze the related challenges in complex scenes. Secondly, we design a Pixel-wise Efficient Grasp Generation Network (PEGG-Net) to tackle the problem of grasping in complex scenes. PEGG-Net can achieve improved state-of-the-art performance on the Cornell dataset (98.9%) and second-best performance on the Jacquard dataset (93.8%), outperforming other existing algorithms without the introduction of complex structures. Thirdly, PEGG-Net could operate in a closed-loop manner for added robustness in dynamic environments using position-based visual servoing (PBVS). Finally, we conduct real-world experiments on static, dynamic, and cluttered objects in different complex scenes. The results show that our proposed network achieves a high success rate in grasping irregular objects, household objects, and workshop tools. To benefit the community, our trained model and supplementary materials are available at https://github.com/HZWang96/PEGG-Net.
CVAug 23, 2021Code
Voxel-based Network for Shape Completion by Leveraging Edge GenerationXiaogang Wang, Marcelo H Ang, Gim Hee Lee
Deep learning technique has yielded significant improvements in point cloud completion with the aim of completing missing object shapes from partial inputs. However, most existing methods fail to recover realistic structures due to over-smoothing of fine-grained details. In this paper, we develop a voxel-based network for point cloud completion by leveraging edge generation (VE-PCN). We first embed point clouds into regular voxel grids, and then generate complete objects with the help of the hallucinated shape edges. This decoupled architecture together with a multi-scale grid feature learning is able to generate more realistic on-surface details. We evaluate our model on the publicly available completion datasets and show that it outperforms existing state-of-the-art approaches quantitatively and qualitatively. Our source code is available at https://github.com/xiaogangw/VE-PCN.
CVAug 2, 2020Code
Point Cloud Completion by Learning Shape PriorsXiaogang Wang, Marcelo H Ang, Gim Hee Lee
In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine strategy to incorporate partial prior in the fine stage. To learn the complete objects prior, we first train a point cloud auto-encoder to extract the latent embeddings from complete points. Then we learn a mapping to transfer the point features from partial points to that of the complete points by optimizing feature alignment losses. The feature alignment losses consist of a L2 distance and an adversarial loss obtained by Maximum Mean Discrepancy Generative Adversarial Network (MMD-GAN). The L2 distance optimizes the partial features towards the complete ones in the feature space, and MMD-GAN decreases the statistical distance of two point features in a Reproducing Kernel Hilbert Space. We achieve state-of-the-art performances on the point cloud completion task. Our code is available at https://github.com/xiaogangw/point-cloud-completion-shape-prior.
CVJul 16, 2020Code
Shape Prior Deformation for Categorical 6D Object Pose and Size EstimationMeng Tian, Marcelo H Ang, Gim Hee Lee
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense correspondences between the depth observation of the object instance and the reconstructed 3D model to jointly estimate the 6D object pose and size. We design an autoencoder that trains on a collection of object models and compute the mean latent embedding for each category to learn the categorical shape priors. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms the state of the art. Our code is available at https://github.com/mentian/object-deformnet.
CVApr 7, 2020Code
Cascaded Refinement Network for Point Cloud CompletionXiaogang Wang, Marcelo H Ang, Gim Hee Lee
Point clouds are often sparse and incomplete. Existing shape completion methods are incapable of generating details of objects or learning the complex point distributions. To this end, we propose a cascaded refinement network together with a coarse-to-fine strategy to synthesize the detailed object shapes. Considering the local details of partial input with the global shape information together, we can preserve the existing details in the incomplete point set and generate the missing parts with high fidelity. We also design a patch discriminator that guarantees every local area has the same pattern with the ground truth to learn the complicated point distribution. Quantitative and qualitative experiments on different datasets show that our method achieves superior results compared to existing state-of-the-art approaches on the 3D point cloud completion task. Our source code is available at https://github.com/xiaogangw/cascaded-point-completion.git.
CVFeb 29, 2020Code
Robust 6D Object Pose Estimation by Learning RGB-D FeaturesMeng Tian, Liang Pan, Marcelo H Ang et al.
Accurate 6D object pose estimation is fundamental to robotic manipulation and grasping. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete-continuous formulation for rotation regression to resolve this local-optimum problem. We uniformly sample rotation anchors in SO(3), and predict a constrained deviation from each anchor to the target, as well as uncertainty scores for selecting the best prediction. Additionally, the object location is detected by aggregating point-wise vectors pointing to the 3D center. Experiments on two benchmarks: LINEMOD and YCB-Video, show that the proposed method outperforms state-of-the-art approaches. Our code is available at https://github.com/mentian/object-posenet.
14.0CVApr 30
3D Reconstruction Techniques in the Manufacturing Domain: Applications, Research Opportunities and Use CasesChialoon Cheng, Kaijun liu, Zhiyang Liu et al.
This comprehensive review examines the evolution and the current state of the art in three-dimensional (3D) reconstruction techniques in manufacturing applications. The analysis covers both traditional approaches and emerging deep learning methods, showing a critical research gap in unified 3d reconstruction frameworks. Through systematic review of 106 recent publications, we classify reconstruction techniques into three primary categories: data acquisition, point cloud generation, post-processing and applications. Non-contact methods, particularly structured light scanning and stereo vision, have shown significant adoption in manufacturing, with 47% of surveyed applications focusing on quality inspection. The integration of deep learning has enhanced reconstruction accuracy and processing speed, particularly in feature extraction and matching. Key applications span design and development (13%), machining (8%), process (17%), assembly (22%), and quality inspection (40%). While current technologies achieve sub-millimeter accuracy in controlled environments, challenges persist in handling reflective surfaces and dynamic environments. Our findings indicate a trend toward hybrid systems combining multiple sensor types and processing methods to overcome individual limitations. This survey provides a structured framework for understanding current capabilities and future directions in manufacturing-focused 3D reconstruction.
CVMar 7
Perception-Aware Multimodal Spatial Reasoning from Monocular ImagesYanchun Cheng, Rundong Wang, Xulei Yang et al.
Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
AINov 16, 2021
Improving Learning from Demonstrations by Learning from ExperienceHaofeng Liu, Yiwen Chen, Jiayi Tan et al.
How to make imitation learning more general when demonstrations are relatively limited has been a persistent problem in reinforcement learning (RL). Poor demonstrations lead to narrow and biased date distribution, non-Markovian human expert demonstration makes it difficult for the agent to learn, and over-reliance on sub-optimal trajectories can make it hard for the agent to improve its performance. To solve these problems we propose a new algorithm named TD3fG that can smoothly transition from learning from experts to learning from experience. Our algorithm achieves good performance in the MUJOCO environment with limited and sub-optimal demonstrations. We use behavior cloning to train the network as a reference action generator and utilize it in terms of both loss function and exploration noise. This innovation can help agents extract a priori knowledge from demonstrations while reducing the detrimental effects of the poor Markovian properties of the demonstrations. It has a better performance compared to the BC+ fine-tuning and DDPGfD approach, especially when the demonstrations are relatively limited. We call our method TD3fG meaning TD3 from a generator.
CVOct 17, 2020
Cascaded Refinement Network for Point Cloud Completion with Self-supervisionXiaogang Wang, Marcelo H Ang, Gim Hee Lee
Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications. Existing shape completion methods tend to generate rough shapes without fine-grained details. Considering this, we introduce a two-branch network for shape completion. The first branch is a cascaded shape completion sub-network to synthesize complete objects, where we propose to use the partial input together with the coarse output to preserve the object details during the dense point reconstruction. The second branch is an auto-encoder to reconstruct the original partial input. The two branches share a same feature extractor to learn an accurate global feature for shape completion. Furthermore, we propose two strategies to enable the training of our network when ground truth data are not available. This is to mitigate the dependence of existing approaches on large amounts of ground truth training data that are often difficult to obtain in real-world applications. Additionally, our proposed strategies are also able to improve the reconstruction quality for fully supervised learning. We verify our approach in self-supervised, semi-supervised and fully supervised settings with superior performances. Quantitative and qualitative results on different datasets demonstrate that our method achieves more realistic outputs than state-of-the-art approaches on the point cloud completion task.