Zheming Zhou

RO
h-index20
10papers
120citations
Novelty57%
AI Score31

10 Papers

CVJul 29, 2024Code
Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning

Ray Zhang, Zheming Zhou, Min Sun et al.

This paper introduces a robust unsupervised SE(3) point cloud registration method that operates without requiring point correspondences. The method frames point clouds as functions in a reproducing kernel Hilbert space (RKHS), leveraging SE(3)-equivariant features for direct feature space registration. A novel RKHS distance metric is proposed, offering reliable performance amidst noise, outliers, and asymmetrical data. An unsupervised training approach is introduced to effectively handle limited ground truth data, facilitating adaptation to real datasets. The proposed method outperforms classical and supervised methods in terms of registration accuracy on both synthetic (ModelNet40) and real-world (ETH3D) noisy, outlier-rich datasets. To our best knowledge, this marks the first instance of successful real RGB-D odometry data registration using an equivariant method. The code is available at {https://sites.google.com/view/eccv24-equivalign}

CVJul 24, 2024
CSCPR: Cross-Source-Context Indoor RGB-D Place Recognition

Jing Liang, Zhuo Deng, Zheming Zhou et al.

We extend our previous work, PoCo, and present a new algorithm, Cross-Source-Context Place Recognition (CSCPR), for RGB-D indoor place recognition that integrates global retrieval and reranking into an end-to-end model and keeps the consistency of using Context-of-Clusters (CoCs) for feature processing. Unlike prior approaches that primarily focus on the RGB domain for place recognition reranking, CSCPR is designed to handle the RGB-D data. We apply the CoCs to handle cross-sourced and cross-scaled RGB-D point clouds and introduce two novel modules for reranking: the Self-Context Cluster (SCC) and the Cross Source Context Cluster (CSCC), which enhance feature representation and match query-database pairs based on local features, respectively. We also release two new datasets, ScanNetIPR and ARKitIPR. Our experiments demonstrate that CSCPR significantly outperforms state-of-the-art models on these datasets by at least 29.27% in Recall@1 on the ScanNet-PR dataset and 43.24% in the new datasets. Code and datasets will be released.

ROOct 15, 2023
Tabletop Transparent Scene Reconstruction via Epipolar-Guided Optical Flow with Monocular Depth Completion Prior

Xiaotong Chen, Zheming Zhou, Zhuo Deng et al.

Reconstructing transparent objects using affordable RGB-D cameras is a persistent challenge in robotic perception due to inconsistent appearances across views in the RGB domain and inaccurate depth readings in each single-view. We introduce a two-stage pipeline for reconstructing transparent objects tailored for mobile platforms. In the first stage, off-the-shelf monocular object segmentation and depth completion networks are leveraged to predict the depth of transparent objects, furnishing single-view shape prior. Subsequently, we propose Epipolar-guided Optical Flow (EOF) to fuse several single-view shape priors from the first stage to a cross-view consistent 3D reconstruction given camera poses estimated from opaque part of the scene. Our key innovation lies in EOF which employs boundary-sensitive sampling and epipolar-line constraints into optical flow to accurately establish 2D correspondences across multiple views on transparent objects. Quantitative evaluations demonstrate that our pipeline significantly outperforms baseline methods in 3D reconstruction quality, paving the way for more adept robotic perception and interaction with transparent objects.

CVMay 21, 2024
The Power of Next-Frame Prediction for Learning Physical Laws

Thomas Winterbottom, G. Thomas Hudson, Daniel Kluvanec et al.

Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.

CVApr 3, 2024
PoCo: Point Context Cluster for RGBD Indoor Place Recognition

Jing Liang, Zhuo Deng, Zheming Zhou et al.

We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.

ROOct 2, 2019
LIT: Light-field Inference of Transparency for Refractive Object Localization

Zheming Zhou, Xiaotong Chen, Odest Chadwicke Jenkins

Translucency is prevalent in everyday scenes. As such, perception of transparent objects is essential for robots to perform manipulation. Compared with texture-rich or texture-less Lambertian objects, transparency induces significant uncertainty on object appearances. Ambiguity can be due to changes in lighting, viewpoint, and backgrounds, each of which brings challenges to existing object pose estimation algorithms. In this work, we propose LIT, a two-stage method for transparent object pose estimation using light-field sensing and photorealistic rendering. LIT employs multiple filters specific to light-field imagery in deep networks to capture transparent material properties, with robust depth and pose estimators based on generative sampling. Along with the LIT algorithm, we introduce the light-field transparent object dataset ProLIT for the tasks of recognition, localization and pose estimation. With respect to this ProLIT dataset, we demonstrate that LIT can outperform both state-of-the-art end-to-end pose estimation methods and a generative pose estimator on transparent objects.

ROSep 10, 2019
GlassLoc: Plenoptic Grasp Pose Detection in Transparent Clutter

Zheming Zhou, Tianyang Pan, Shiyu Wu et al.

Transparent objects are prevalent across many environments of interest for dexterous robotic manipulation. Such transparent material leads to considerable uncertainty for robot perception and manipulation, and remains an open challenge for robotics. This problem is exacerbated when multiple transparent objects cluster into piles of clutter. In household environments, for example, it is common to encounter piles of glassware in kitchens, dining rooms, and reception areas, which are essentially invisible to modern robots. We present the GlassLoc algorithm for grasp pose detection of transparent objects in transparent clutter using plenoptic sensing. GlassLoc classifies graspable locations in space informed by a Depth Likelihood Volume (DLV) descriptor. We extend the DLV to infer the occupancy of transparent objects over a given space from multiple plenoptic viewpoints. We demonstrate and evaluate the GlassLoc algorithm on a Michigan Progress Fetch mounted with a first-generation Lytro. The effectiveness of our algorithm is evaluated through experiments for grasp detection and execution with a variety of transparent glassware in minor clutter.

ROJun 26, 2018
Plenoptic Monte Carlo Object Localization for Robot Grasping under Layered Translucency

Zheming Zhou, Zhiqiang Sui, Odest Chadwicke Jenkins

In order to fully function in human environments, robot perception will need to account for the uncertainty caused by translucent materials. Translucency poses several open challenges in the form of transparent objects (e.g., drinking glasses), refractive media (e.g., water), and diffuse partial occlusions (e.g., objects behind stained glass panels). This paper presents Plenoptic Monte Carlo Localization (PMCL) as a method for localizing object poses in the presence of translucency using plenoptic (light-field) observations. We propose a new depth descriptor, the Depth Likelihood Volume (DLV), and its use within a Monte Carlo object localization algorithm. We present results of localizing and manipulating objects with translucent materials and objects occluded by layers of translucency. Our PMCL implementation uses observations from a Lytro first generation light field camera to allow a Michigan Progress Fetch robot to perform grasping.

ROApr 4, 2017
Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes

Zhen Zeng, Zheming Zhou, Zhiqiang Sui et al.

We present the Semantic Robot Programming (SRP) paradigm as a convergence of robot programming by demonstration and semantic mapping. In SRP, a user can directly program a robot manipulator by demonstrating a snapshot of their intended goal scene in workspace. The robot then parses this goal as a scene graph comprised of object poses and inter-object relations, assuming known object geometries. Task and motion planning is then used to realize the user's goal from an arbitrary initial scene configuration. Even when faced with different initial scene configurations, SRP enables the robot to seamlessly adapt to reach the user's demonstrated goal. For scene perception, we propose the Discriminatively-Informed Generative Estimation of Scenes and Transforms (DIGEST) method to infer the initial and goal states of the world from RGBD images. The efficacy of SRP with DIGEST perception is demonstrated for the task of tray-setting with a Michigan Progress Fetch robot. Scene perception and task execution are evaluated with a public household occlusion dataset and our cluttered scene dataset.

ROMar 22, 2017
SUM: Sequential Scene Understanding and Manipulation

Zhiqiang Sui, Zheming Zhou, Zhen Zeng et al.

In order to perform autonomous sequential manipulation tasks, perception in cluttered scenes remains a critical challenge for robots. In this paper, we propose a probabilistic approach for robust sequential scene estimation and manipulation - Sequential Scene Understanding and Manipulation(SUM). SUM considers uncertainty due to discriminative object detection and recognition in the generative estimation of the most likely object poses maintained over time to achieve a robust estimation of the scene under heavy occlusions and unstructured environment. Our method utilizes candidates from discriminative object detector and recognizer to guide the generative process of sampling scene hypothesis, and each scene hypotheses is evaluated against the observations. Also SUM maintains beliefs of scene hypothesis over robot physical actions for better estimation and against noisy detections. We conduct extensive experiments to show that our approach is able to perform robust estimation and manipulation.