CVAug 17, 2023
MIPS-Fusion: Multi-Implicit-Submaps for Scalable and Robust Online Neural RGB-D ReconstructionYijie Tang, Jiazhao Zhang, Zhinan Yu et al.
We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on a novel neural implicit representation -- multi-implicit-submap. Different from existing neural RGB-D reconstruction methods lacking either flexibility with a single neural map or scalability due to extra storage of feature grids, we propose a pure neural representation tackling both difficulties with a divide-and-conquer design. In our method, neural submaps are incrementally allocated alongside the scanning trajectory and efficiently learned with local neural bundle adjustments. The submaps can be refined individually in a back-end optimization and optimized jointly to realize submap-level loop closure. Meanwhile, we propose a hybrid tracking approach combining randomized and gradient-based pose optimizations. For the first time, randomized optimization is made possible in neural tracking with several key designs to the learning process, enabling efficient and robust tracking even under fast camera motions. The extensive evaluation demonstrates that our method attains higher reconstruction quality than the state of the arts for large-scale scenes and under fast camera motions.
ROAug 22, 2024
LLM-enhanced Scene Graph Learning for Household RearrangementWenhao Li, Zhiyuan Yu, Qijin She et al.
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. It depends both on common-sense knowledge on the objective side and human user preference on the subjective side. In achieving such task, we propose to mine object functionality with user preference alignment directly from the scene itself, without relying on human intervention. To do so, we work with scene graph representation and propose LLM-enhanced scene graph learning which transforms the input scene graph into an affordance-enhanced graph (AEG) with information-enhanced nodes and newly discovered edges (relations). In AEG, the nodes corresponding to the receptacle objects are augmented with context-induced affordance which encodes what kind of carriable objects can be placed on it. New edges are discovered with newly discovered non-local relations. With AEG, we perform task planning for scene rearrangement by detecting misplaced carriables and determining a proper placement for each of them. We test our method by implementing a tiding robot in simulator and perform evaluation on a new benchmark we build. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on misplacement detection and the following rearrangement planning.
CVMay 1, 2024Code
F2M-Reg: Unsupervised RGB-D Point Cloud Registration with Frame-to-Model OptimizationZhinan Yu, Zheng Qin, Yijie Tang et al.
This work studies the problem of unsupervised RGB-D point cloud registration, which aims at training a robust registration model without ground-truth pose supervision. Existing methods usually leverages unposed RGB-D sequences and adopt a frame-to-frame framework based on differentiable rendering to train the registration model, which enforces the photometric and geometric consistency between the two frames for supervision. However, this frame-to-frame framework is vulnerable to inconsistent factors between different frames, e.g., lighting changes, geometry occlusion, and reflective materials, which leads to suboptimal convergence of the registration model. In this paper, we propose a novel frame-to-model optimization framework named F2M-Reg for unsupervised RGB-D point cloud registration. We leverage the neural implicit field as a global model of the scene and optimize the estimated poses of the frames by registering them to the global model, and the registration model is subsequently trained with the optimized poses. Thanks to the global encoding capability of neural implicit field, our frame-to-model framework is significantly more robust to inconsistent factors between different frames and thus can provide better supervision for the registration model. Besides, we demonstrate that F2M-Reg can be further enhanced by a simplistic synthetic warming-up strategy. To this end, we construct a photorealistic synthetic dataset named Sim-RGBD to initialize the registration model for the frame-to-model optimization on real-world RGB-D sequences. Extensive experiments on four challenging benchmarks have shown that our method surpasses the previous state-of-the-art counterparts by a large margin, especially under scenarios with severe lighting changes and low overlap. Our code and models are available at https://github.com/MrIsland/F2M_Reg.
CVDec 11, 2024
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMsYihan Cao, Jiazhao Zhang, Zhinan Yu et al.
Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.
CVDec 11, 2024
Neural Observation Field Guided Hybrid Optimization of Camera PlacementYihan Cao, Jiazhao Zhang, Zhinan Yu et al.
Camera placement is crutial in multi-camera systems such as virtual reality, autonomous driving, and high-quality reconstruction. The camera placement challenge lies in the nonlinear nature of high-dimensional parameters and the unavailability of gradients for target functions like coverage and visibility. Consequently, most existing methods tackle this challenge by leveraging non-gradient-based optimization methods.In this work, we present a hybrid camera placement optimization approach that incorporates both gradient-based and non-gradient-based optimization methods. This design allows our method to enjoy the advantages of smooth optimization convergence and robustness from gradient-based and non-gradient-based optimization, respectively. To bridge the two disparate optimization methods, we propose a neural observation field, which implicitly encodes the coverage and observation quality. The neural observation field provides the measurements of the camera observations and corresponding gradients without the assumption of target scenes, making our method applicable to diverse scenarios, including 2D planar shapes, 3D objects, and room-scale 3D scenes.Extensive experiments on diverse datasets demonstrate that our method achieves state-of-the-art performance, while requiring only a fraction (8x less) of the typical computation time. Furthermore, we conducted a real-world experiment using a custom-built capture system, confirming the resilience of our approach to real-world environmental noise.