Mingshan Sun

CV
h-index5
6papers
49citations
Novelty59%
AI Score39

6 Papers

CVAug 15, 2022
Uni6Dv2: Noise Elimination for 6D Pose Estimation

Mingshan Sun, Ye Zheng, Tianpeng Bao et al.

Uni6D is the first 6D pose estimation approach to employ a unified backbone network to extract features from both RGB and depth images. We discover that the principal reasons of Uni6D performance limitations are Instance-Outside and Instance-Inside noise. Uni6D's simple pipeline design inherently introduces Instance-Outside noise from background pixels in the receptive field, while ignoring Instance-Inside noise in the input depth data. In this paper, we propose a two-step denoising approach for dealing with the aforementioned noise in Uni6D. To reduce noise from non-instance regions, an instance segmentation network is utilized in the first step to crop and mask the instance. A lightweight depth denoising module is proposed in the second step to calibrate the depth feature before feeding it into the pose regression network. Extensive experiments show that our Uni6Dv2 reliably and robustly eliminates noise, outperforming Uni6D without sacrificing too much inference efficiency. It also reduces the need for annotated real data that requires costly labeling.

CVMar 15, 2023
SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers

Guoqiang Jin, Fan Yang, Mingshan Sun et al.

Self-supervised pre-training and transformer-based networks have significantly improved the performance of object detection. However, most of the current self-supervised object detection methods are built on convolutional-based architectures. We believe that the transformers' sequence characteristics should be considered when designing a transformer-based self-supervised method for the object detection task. To this end, we propose SeqCo-DETR, a novel Sequence Consistency-based self-supervised method for object DEtection with TRansformers. SeqCo-DETR defines a simple but effective pretext by minimizes the discrepancy of the output sequences of transformers with different image views as input and leverages bipartite matching to find the most relevant sequence pairs to improve the sequence-level self-supervised representation learning performance. Furthermore, we provide a mask-based augmentation strategy incorporated with the sequence consistency strategy to extract more representative contextual information about the object for the object detection task. Our method achieves state-of-the-art results on MS COCO (45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our approach.

CVOct 20, 2022
Geo6D: Geometric Constraints Learning for 6D Pose Estimation

Jianqiu Chen, Mingshan Sun, Ye Zheng et al.

Numerous 6D pose estimation methods have been proposed that employ end-to-end regression to directly estimate the target pose parameters. Since the visible features of objects are implicitly influenced by their poses, the network allows inferring the pose by analyzing the differences in features in the visible region. However, due to the unpredictable and unrestricted range of pose variations, the implicitly learned visible feature-pose constraints are insufficiently covered by the training samples, making the network vulnerable to unseen object poses. To tackle these challenges, we proposed a novel geometric constraints learning approach called Geo6D for direct regression 6D pose estimation methods. It introduces a pose transformation formula expressed in relative offset representation, which is leveraged as geometric constraints to reconstruct the input and output targets of the network. These reconstructed data enable the network to estimate the pose based on explicit geometric constraints and relative offset representation mitigates the issue of the pose distribution gap. Extensive experimental results show that when equipped with Geo6D, the direct 6D methods achieve state-of-the-art performance on multiple datasets and demonstrate significant effectiveness, even with only 10% amount of data.

RODec 2, 2025
SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Shengkai Wu, Jinrong Yang, Wenqiu Luo et al.

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

ROMay 23, 2025
Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space

Jinrong Yang, Kexun Chen, Zhuoling Li et al.

Imitation learning (IL) with human demonstrations is a promising method for robotic manipulation tasks. While minimal demonstrations enable robotic action execution, achieving high success rates and generalization requires high cost, e.g., continuously adding data or incrementally conducting human-in-loop processes with complex hardware/software systems. In this paper, we rethink the state/action space of the data collection pipeline as well as the underlying factors responsible for the prediction of non-robust actions. To this end, we introduce a Hierarchical Data Collection Space (HD-Space) for robotic imitation learning, a simple data collection scheme, endowing the model to train with proactive and high-quality data. Specifically, We segment the fine manipulation task into multiple key atomic tasks from a high-level perspective and design atomic state/action spaces for human demonstrations, aiming to generate robust IL data. We conduct empirical evaluations across two simulated and five real-world long-horizon manipulation tasks and demonstrate that IL policy training with HD-Space-based data can achieve significantly enhanced policy performance. HD-Space allows the use of a small amount of demonstration data to train a more powerful policy, particularly for long-horizon manipulation tasks. We aim for HD-Space to offer insights into optimizing data quality and guiding data scaling. project page: https://hd-space-robotics.github.io.

CVMay 29, 2023
ZeroPose: CAD-Prompted Zero-shot Object 6D Pose Estimation in Cluttered Scenes

Jianqiu Chen, Zikun Zhou, Mingshan Sun et al.

Many robotics and industry applications have a high demand for the capability to estimate the 6D pose of novel objects from the cluttered scene. However, existing classic pose estimation methods are object-specific, which can only handle the specific objects seen during training. When applied to a novel object, these methods necessitate a cumbersome onboarding process, which involves extensive dataset preparation and model retraining. The extensive duration and resource consumption of onboarding limit their practicality in real-world applications. In this paper, we introduce ZeroPose, a novel zero-shot framework that performs pose estimation following a Discovery-Orientation-Registration (DOR) inference pipeline. This framework generalizes to novel objects without requiring model retraining. Given the CAD model of a novel object, ZeroPose enables in seconds onboarding time to extract visual and geometric embeddings from the CAD model as a prompt. With the prompting of the above embeddings, DOR can discover all related instances and estimate their 6D poses without additional human interaction or presupposing scene conditions. Compared with existing zero-shot methods solved by the render-and-compare paradigm, the DOR pipeline formulates the object pose estimation into a feature-matching problem, which avoids time-consuming online rendering and improves efficiency. Experimental results on the seven datasets show that ZeroPose as a zero-shot method achieves comparable performance with object-specific training methods and outperforms the state-of-the-art zero-shot method with 50x inference speed improvement.