CVMar 19, 2022Code
Occlusion-Aware Self-Supervised Monocular 6D Object Pose EstimationGu Wang, Fabian Manhardt, Xingyu Liu et al. · tsinghua
6D object pose estimation is a fundamental yet challenging problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even under monocular settings. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this limitation, we propose a novel monocular 6D pose estimation approach by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage current trends in noisy student training and differentiable rendering to further self-supervise the model on these unsupervised real RGB(-D) samples, seeking for a visually and geometrically optimal alignment. Moreover, employing both visible and amodal mask information, our self-supervision becomes very robust towards challenging scenarios such as occlusion. Extensive evaluations demonstrate that our proposed self-supervision outperforms all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm. Noteworthy, our self-supervised approach consistently improves over its synthetically trained baseline and often almost closes the gap towards its fully supervised counterpart. The code and models are publicly available at https://github.com/THU-DA-6D-Pose-Group/self6dpp.git.
CVFeb 25, 2023
BOP Challenge 2022 on Detection, Segmentation and Pose Estimation of Specific Rigid ObjectsMartin Sundermeyer, Tomas Hodan, Yann Labbe et al. · tsinghua
We present the evaluation methodology, datasets and results of the BOP Challenge 2022, the fourth in a series of public competitions organized with the goal to capture the status quo in the field of 6D object pose estimation from an RGB/RGB-D image. In 2022, we witnessed another significant improvement in the pose estimation accuracy -- the state of the art, which was 56.9 AR$_C$ in 2019 (Vidal et al.) and 69.8 AR$_C$ in 2020 (CosyPose), moved to new heights of 83.7 AR$_C$ (GDRNPP). Out of 49 pose estimation methods evaluated since 2019, the top 18 are from 2022. Methods based on point pair features, which were introduced in 2010 and achieved competitive results even in 2020, are now clearly outperformed by deep learning methods. The synthetic-to-real domain gap was again significantly reduced, with 82.7 AR$_C$ achieved by GDRNPP trained only on synthetic images from BlenderProc. The fastest variant of GDRNPP reached 80.5 AR$_C$ with an average time per image of 0.23s. Since most of the recent methods for 6D object pose estimation begin by detecting/segmenting objects, we also started evaluating 2D object detection and segmentation performance based on the COCO metrics. Compared to the Mask R-CNN results from CosyPose in 2020, detection improved from 60.3 to 77.3 AP$_C$ and segmentation from 40.5 to 58.7 AP$_C$. The online evaluation system stays open and is available at: \href{http://bop.felk.cvut.cz/}{bop.felk.cvut.cz}.
CVSep 24, 2024Code
LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose EstimationRuida Zhang, Ziqin Huang, Gu Wang et al. · tsinghua
While RGBD-based methods for category-level object pose estimation hold promise, their reliance on depth data limits their applicability in diverse scenarios. In response, recent efforts have turned to RGB-based methods; however, they face significant challenges stemming from the absence of depth information. On one hand, the lack of depth exacerbates the difficulty in handling intra-class shape variation, resulting in increased uncertainty in shape predictions. On the other hand, RGB-only inputs introduce inherent scale ambiguity, rendering the estimation of object size and translation an ill-posed problem. To tackle these challenges, we propose LaPose, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation. By representing each point as a probabilistic distribution, we explicitly quantify the shape uncertainty. LaPose leverages both a generalized 3D information stream and a specialized feature stream to independently predict the Laplacian distribution for each point, capturing different aspects of object geometry. These two distributions are then integrated as a Laplacian mixture model to establish the 2D-3D correspondences, which are utilized to solve the pose via the PnP module. In order to mitigate scale ambiguity, we introduce a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness. Extensive experiments on the NOCS datasets validate the effectiveness of LaPose, yielding state-of-the-art performance in RGB-based category-level object pose estimation. Codes are released at https://github.com/lolrudy/LaPose
CVJul 17, 2022
CATRE: Iterative Point Clouds Alignment for Category-level Object Pose RefinementXingyu Liu, Gu Wang, Yi Li et al. · tsinghua
While category-level 9DoF object pose estimation has emerged recently, previous correspondence-based or direct regression methods are both limited in accuracy due to the huge intra-category variances in object shape and color, etc. Orthogonal to them, this work presents a category-level object pose and size refiner CATRE, which is able to iteratively enhance pose estimate from point clouds to produce accurate results. Given an initial pose estimate, CATRE predicts a relative transformation between the initial pose and ground truth by means of aligning the partially observed point cloud and an abstract shape prior. In specific, we propose a novel disentangled architecture being aware of the inherent distinctions between rotation and translation/size estimation. Extensive experiments show that our approach remarkably outperforms state-of-the-art methods on REAL275, CAMERA25, and LM benchmarks up to a speed of ~85.32Hz, and achieves competitive results on category-level tracking. We further demonstrate that CATRE can perform pose refinement on unseen category. Code and trained models are available.
CVSep 25, 2024Code
FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose EstimationJingyi Tang, Gu Wang, Zeyu Chen et al. · tsinghua
Although methods for estimating the pose of objects in indoor scenes have achieved great success, the pose estimation of underwater objects remains challenging due to difficulties brought by the complex underwater environment, such as degraded illumination, blurring, and the substantial cost of obtaining real annotations. In response, we introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs). Essentially, we first train a frequency-aware flow-based pose estimator on synthetic data, where an FFT-based augmentation approach is proposed to facilitate the network in capturing domain-invariant features and target domain styles from a frequency perspective. Further, we perform self-supervised training by enforcing flow-aided multi-level consistencies to adapt it to the real-world underwater environment. Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods. Code is available at github.com/tjy0703/FAFA.
CVOct 18, 2023
MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware SupervisionChenyangguang Zhang, Guanlong Jiao, Yan Di et al. · tsinghua
Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models, which are hard to collect in real world. In contrast, readily accessible hand-object videos offer a promising training data source, but they only give heavily occluded object observations. In this paper, we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image, tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First, in the synthetic pre-training stage, we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions, adopted to address hand-induced occlusion in both 2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages the amodal-mask-weighted geometric supervision to mitigate the unfaithful guidance caused by the hand-occluded supervising views in real world. Moreover, domain-consistent occlusion-aware features are amalgamated in MOHO to resist object's self-occlusion for inferring the complete object shape. Extensive experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.
CVNov 23, 2023
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object ReconstructionBowen Fu, Gu Wang, Chenyangguang Zhang et al. · tsinghua
Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods.
CVNov 25, 2024Code
UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference ImageXingyu Liu, Gu Wang, Ruida Zhang et al. · tsinghua
Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset are available at https://github.com/shanice-l/UNOPose.
CVMar 19, 2025Code
GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose EstimationZinqin Huang, Gu Wang, Chenyangguang Zhang et al. · tsinghua
Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.
16.9NAMar 18
State-dependent temperature control in Langevin diffusions using numerical exploratory Hamiltonian-Jacobi-Bellman equationsTaorui Wang, Xun Li, Gu Wang et al.
Choosing how much noise to add in Langevin dynamics is essential for making these algorithms effective in challenging optimization problems. One promising approach is to determine this noise by solving Hamilton-Jacobi-Bellman (HJB) equations and their exploratory variants. Though these ideas have been demonstrated to work well in one dimension, extension to high-dimensional minimization has been limited by two unresolved numerical challenges: setting reliable control bounds and stably computing the second-order information (Hessians) required by the equations. These issues and the broader impact of HJB parameters have not been systematically examined. This work provides the first such investigation. We introduce principled control bounds and develop a physics-informed neural network framework that embeds the structure of exploratory HJB equations directly into training, stabilizing computation, and enabling accurate estimation of state-dependent noise in high-dimensional problems. Numerical experiments demonstrate that the resulting method remains robust and effective well beyond low-dimensional test cases.
CVFeb 24, 2021Code
GDRNPP: A Geometry-guided and Fully Learning-based Object Pose EstimatorXingyu Liu, Ruida Zhang, Chenyangguang Zhang et al.
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.
CVJan 28
HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion GenerationMengge Liu, Yan Di, Gu Wang et al.
Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.
CVDec 2, 2024
GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024Xingyu Liu, Gu Wang, Chengxi Li et al. · tsinghua
We present GFreeDet, an unseen object detection approach that leverages Gaussian splatting and vision Foundation models under model-free setting. Unlike existing methods that rely on predefined CAD templates, GFreeDet reconstructs objects directly from reference videos using Gaussian splatting, enabling robust detection of novel objects without prior 3D models. Evaluated on the BOP-H3 benchmark, GFreeDet achieves comparable performance to CAD-based methods, demonstrating the viability of model-free detection for mixed reality (MR) applications. Notably, GFreeDet won the best overall method and the best fast method awards in the model-free 2D detection track at BOP Challenge 2024.
CVApr 5, 2024
RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world ApplicationsXingyu Liu, Chenyangguang Zhang, Gu Wang et al. · tsinghua
In robotic vision, a de-facto paradigm is to learn in simulated environments and then transfer to real-world applications, which poses an essential challenge in bridging the sim-to-real domain gap. While mainstream works tackle this problem in the RGB domain, we focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim). In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. A range-aware rendering strategy is further introduced to enrich data diversity. Extensive experiments show that models trained with RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.
OCJun 9, 2025
Continuous Policy and Value Iteration for Stochastic Control Problems and Its ConvergenceQi Feng, Gu Wang
We introduce a continuous policy-value iteration algorithm where the approximations of the value function of a stochastic control problem and the optimal control are simultaneously updated through Langevin-type dynamics. This framework applies to both the entropy-regularized relaxed control problems and the classical control problems, with infinite horizon. We establish policy improvement and demonstrate convergence to the optimal control under the monotonicity condition of the Hamiltonian. By utilizing Langevin-type stochastic differential equations for continuous updates along the policy iteration direction, our approach enables the use of distribution sampling and non-convex learning techniques in machine learning to optimize the value function and identify the optimal control simultaneously.
CVMar 14, 2024
BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid ObjectsTomas Hodan, Martin Sundermeyer, Yann Labbe et al.
We present the evaluation methodology, datasets and results of the BOP Challenge 2023, the fifth in a series of public competitions organized to capture the state of the art in model-based 6D object pose estimation from an RGB/RGB-D image and related tasks. Besides the three tasks from 2022 (model-based 2D detection, 2D segmentation, and 6D localization of objects seen during training), the 2023 challenge introduced new variants of these tasks focused on objects unseen during training. In the new tasks, methods were required to learn new objects during a short onboarding stage (max 5 minutes, 1 GPU) from provided 3D object models. The best 2023 method for 6D localization of unseen objects (GenFlow) notably reached the accuracy of the best 2020 method for seen objects (CosyPose), although being noticeably slower. The best 2023 method for seen objects (GPose) achieved a moderate accuracy improvement but a significant 43% run-time improvement compared to the best 2022 counterpart (GDRNPP). Since 2017, the accuracy of 6D localization of seen objects has improved by more than 50% (from 56.9 to 85.6 AR_C). The online evaluation system stays open and is available at: http://bop.felk.cvut.cz/.
CVAug 18, 2021
SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose EstimationYan Di, Fabian Manhardt, Gu Wang et al.
Directly regressing all 6 degrees-of-freedom (6DoF) for the object pose (e.g. the 3D rotation and translation) in a cluttered environment from a single RGB image is a challenging problem. While end-to-end methods have recently demonstrated promising results at high efficiency, they are still inferior when compared with elaborate P$n$P/RANSAC-based approaches in terms of pose accuracy. In this work, we address this shortcoming by means of a novel reasoning about self-occlusion, in order to establish a two-layer representation for 3D objects which considerably enhances the accuracy of end-to-end 6D pose estimation. Our framework, named SO-Pose, takes a single RGB image as input and respectively generates 2D-3D correspondences as well as self-occlusion information harnessing a shared encoder and two separate decoders. Both outputs are then fused to directly regress the 6DoF pose parameters. Incorporating cross-layer consistencies that align correspondences, self-occlusion and 6D pose, we can further improve accuracy and robustness, surpassing or rivaling all other state-of-the-art approaches on various challenging datasets.
CVFeb 24, 2021
PFRL: Pose-Free Reinforcement Learning for 6D Pose EstimationJianzhun Shao, Yuhang Jiang, Gu Wang et al.
6D pose estimation from a single RGB image is a challenging and vital task in computer vision. The current mainstream deep model methods resort to 2D images annotated with real-world ground-truth 6D object poses, whose collection is fairly cumbersome and expensive, even unavailable in many cases. In this work, to get rid of the burden of 6D annotations, we formulate the 6D pose refinement as a Markov Decision Process and impose on the reinforcement learning approach with only 2D image annotations as weakly-supervised 6D pose information, via a delicate reward definition and a composite reinforced optimization method for efficient and effective policy training. Experiments on LINEMOD and T-LESS datasets demonstrate that our Pose-Free approach is able to achieve state-of-the-art performance compared with the methods without using real-world ground-truth 6D pose labels.
CVApr 14, 2020
Self6D: Self-Supervised Monocular 6D Object Pose EstimationGu Wang, Fabian Manhardt, Jianzhun Shao et al.
6D object pose estimation is a fundamental problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even from monocular images. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this shortcoming, we propose the idea of monocular 6D pose estimation by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage recent advances in neural rendering to further self-supervise the model on unannotated real RGB-D data, seeking for a visually and geometrically optimal alignment. Extensive evaluations demonstrate that our proposed self-supervision is able to significantly enhance the model's original performance, outperforming all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm.
CVMar 12, 2020
CPS++: Improving Class-level 6D Pose and Shape Estimation From Monocular Images With Self-Supervised LearningFabian Manhardt, Gu Wang, Benjamin Busam et al.
Contemporary monocular 6D pose estimation methods can only cope with a handful of object instances. This naturally hampers possible applications as, for instance, robots seamlessly integrated in everyday processes necessarily require the ability to work with hundreds of different objects. To tackle this problem of immanent practical relevance, we propose a novel method for class-level monocular 6D pose estimation, coupled with metric shape retrieval. Unfortunately, acquiring adequate annotations is very time-consuming and labor intensive. This is especially true for class-level 6D pose estimation, as one is required to create a highly detailed reconstruction for all objects and then annotate each object and scene using these models. To overcome this shortcoming, we additionally propose the idea of synthetic-to-real domain transfer for class-level 6D poses by means of self-supervised learning, which removes the burden of collecting numerous manual annotations. In essence, after training our proposed method fully supervised with synthetic data, we leverage recent advances in differentiable rendering to self-supervise the model with unannotated real RGB-D data to improve latter inference. We experimentally demonstrate that we can retrieve precise 6D poses and metric shapes from a single RGB image.
CVMar 31, 2018
DeepIM: Deep Iterative Matching for 6D Pose EstimationYi Li, Gu Wang, Xiangyang Ji et al.
Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the observed image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
CVJul 9, 2016
Action Recognition with Joint Attention on Multi-Level Deep FeaturesJialin Wu, Gu Wang, Wukui Yang et al.
We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In our method, a multi-branch model is proposed to suppress noise from background jitters. Specifically, we firstly extract multi-level deep features from deep CNNs and feed them into 3d-convolutional network. After that we feed those feature cubes into our novel joint LSTM module to predict labels and to generate attention regularization. We evaluate our model on two challenging datasets: UCF101 and HMDB51. The results show that our model achieves the state-of-art by only using convolutional features.