Hongbin Zha

CV
h-index13
36papers
2,250citations
Novelty55%
AI Score58

36 Papers

CVOct 13, 2023Code
TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in Structured Light System

Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha

We introduced Temporally Incremental Disparity Estimation Network (TIDE-Net), a learning-based technique for disparity computation in mono-camera structured light systems. In our hardware setting, a static pattern is projected onto a dynamic scene and captured by a monocular camera. Different from most former disparity estimation methods that operate in a frame-wise manner, our network acquires disparity maps in a temporally incremental way. Specifically, We exploit the deformation of projected patterns (named pattern flow ) on captured image sequences, to model the temporal information. Notably, this newly proposed pattern flow formulation reflects the disparity changes along the epipolar line, which is a special form of optical flow. Tailored for pattern flow, the TIDE-Net, a recurrent architecture, is proposed and implemented. For each incoming frame, our model fuses correlation volumes (from current frame) and disparity (from former frame) warped by pattern flow. From fused features, the final stage of TIDE-Net estimates the residual disparity rather than the full disparity, as conducted by many previous methods. Interestingly, this design brings clear empirical advantages in terms of efficiency and generalization ability. Using only synthetic data for training, our extensitve evaluation results (w.r.t. both accuracy and efficienty metrics) show superior performance than several SOTA models on unseen real data. The code is available on https://github.com/CodePointer/TIDENet.

CVAug 6, 2023Code
ECT: Fine-grained Edge Detection with Learned Cause Tokens

Shaocong Xu, Xiaoxue Chen, Yuhang Zheng et al.

In this study, we tackle the challenging fine-grained edge detection task, which refers to predicting specific edges caused by reflectance, illumination, normal, and depth changes, respectively. Prior methods exploit multi-scale convolutional networks, which are limited in three aspects: (1) Convolutions are local operators while identifying the cause of edge formation requires looking at far away pixels. (2) Priors specific to edge cause are fixed in prediction heads. (3) Using separate networks for generic and fine-grained edge detection, and the constraint between them may be violated. To address these three issues, we propose a two-stage transformer-based network sequentially predicting generic edges and fine-grained edges, which has a global receptive field thanks to the attention mechanism. The prior knowledge of edge causes is formulated as four learnable cause tokens in a cause-aware decoder design. Furthermore, to encourage the consistency between generic edges and fine-grained edges, an edge aggregation and alignment loss is exploited. We evaluate our method on the public benchmark BSDS-RIND and several newly derived benchmarks, and achieve new state-of-the-art results. Our code, data, and models are publicly available at https://github.com/Daniellli/ECT.git.

CVAug 30, 2023
Active Neural Mapping

Zike Yan, Haoxiang Yang, Hongbin Zha

We address the problem of active mapping with a continually-learned neural scene representation, namely Active Neural Mapping. The key lies in actively finding the target space to be explored with efficient agent movement, thus minimizing the map uncertainty on-the-fly within a previously unseen environment. In this paper, we examine the weight space of the continually-learned neural field, and show empirically that the neural variability, the prediction robustness against random weight perturbation, can be directly utilized to measure the instant uncertainty of the neural map. Together with the continuous geometric information inherited in the neural map, the agent can be guided to find a traversable path to gradually gain knowledge of the environment. We present for the first time an active mapping system with a coordinate-based implicit neural representation for online scene reconstruction. Experiments in the visually-realistic Gibson and Matterport3D environment demonstrate the efficacy of the proposed method.

CVJan 31, 2023
From Semi-supervised to Omni-supervised Room Layout Estimation Using Point Clouds

Huan-ang Gao, Beiwen Tian, Pengfei Li et al.

Room layout estimation is a long-existing robotic vision task that benefits both environment sensing and motion planning. However, layout estimation using point clouds (PCs) still suffers from data scarcity due to annotation difficulty. As such, we address the semi-supervised setting of this task based upon the idea of model exponential moving averaging. But adapting this scheme to the state-of-the-art (SOTA) solution for PC-based layout estimation is not straightforward. To this end, we define a quad set matching strategy and several consistency losses based upon metrics tailored for layout quads. Besides, we propose a new online pseudo-label harvesting algorithm that decomposes the distribution of a hybrid distance measure between quads and PC into two components. This technique does not need manual threshold selection and intuitively encourages quads to align with reliable layout points. Surprisingly, this framework also works for the fully-supervised setting, achieving a new SOTA on the ScanNet benchmark. Last but not least, we also push the semi-supervised setting to the realistic omni-supervised setting, demonstrating significantly promoted performance on a newly annotated ARKitScenes testing set. Our codes, data and models are released in this repository.

CVOct 23, 2022
SC-wLS: Towards Interpretable Feed-forward Camera Re-localization

Xin Wu, Hao Zhao, Shunkai Li et al.

Visual re-localization aims to recover camera poses in a known environment, which is vital for applications like robotics or augmented reality. Feed-forward absolute camera pose regression methods directly output poses by a network, but suffer from low accuracy. Meanwhile, scene coordinate based methods are accurate, but need iterative RANSAC post-processing, which brings challenges to efficient end-to-end training and inference. In order to have the best of both worlds, we propose a feed-forward method termed SC-wLS that exploits all scene coordinate estimates for weighted least squares pose regression. This differentiable formulation exploits a weight network imposed on 2D-3D correspondences, and requires pose supervision only. Qualitative results demonstrate the interpretability of learned weights. Evaluations on 7Scenes and Cambridge datasets show significantly promoted performance when compared with former feed-forward counterparts. Moreover, our SC-wLS method enables a new capability: self-supervised test-time adaptation on the weight network. Codes and models are publicly available.

CVMar 12Code
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li, Dong Wu, Jun Li et al.

Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.

CVSep 30, 2024
Active Neural Mapping at Scale

Zijia Kuang, Zike Yan, Hao Zhao et al.

We introduce a NeRF-based active mapping system that enables efficient and robust exploration of large-scale indoor environments. The key to our approach is the extraction of a generalized Voronoi graph (GVG) from the continually updated neural map, leading to the synergistic integration of scene geometry, appearance, topology, and uncertainty. Anchoring uncertain areas induced by the neural map to the vertices of GVG allows the exploration to undergo adaptive granularity along a safe path that traverses unknown areas efficiently. Harnessing a modern hybrid NeRF representation, the proposed system achieves competitive results in terms of reconstruction accuracy, coverage completeness, and exploration efficiency even when scaling up to large indoor environments. Extensive results at different scales validate the efficacy of the proposed system.

ROApr 9
Reflection-Based Task Adaptation for Self-Improving VLA

Baicheng Li, Dong Wu, Zike Yan et al.

Pre-trained Vision-Language-Action (VLA) models represent a major leap towards general-purpose robots, yet efficiently adapting them to novel, specific tasks in-situ remains a significant hurdle. While reinforcement learning (RL) is a promising avenue for such adaptation, the process often suffers from low efficiency, hindering rapid task mastery. We introduce Reflective Self-Adaptation, a framework for rapid, autonomous task adaptation without human intervention. Our framework establishes a self-improving loop where the agent learns from its own experience to enhance both strategy and execution. The core of our framework is a dual-pathway architecture that addresses the full adaptation lifecycle. First, a Failure-Driven Reflective RL pathway enables rapid learning by using the VLM's causal reasoning to automatically synthesize a targeted, dense reward function from failure analysis. This provides a focused learning signal that significantly accelerates policy exploration. However, optimizing such proxy rewards introduces a potential risk of "reward hacking," where the agent masters the reward function but fails the actual task. To counteract this, our second pathway, Success-Driven Quality-Guided SFT, grounds the policy in holistic success. It identifies and selectively imitates high-quality successful trajectories, ensuring the agent remains aligned with the ultimate task goal. This pathway is strengthened by a conditional curriculum mechanism to aid initial exploration. We conduct experiments in challenging manipulation tasks. The results demonstrate that our framework achieves faster convergence and higher final success rates compared to representative baselines. Our work presents a robust solution for creating self-improving agents that can efficiently and reliably adapt to new environments.

CVJul 18, 2024
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Baicheng Li, Zike Yan, Dong Wu et al.

Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different views with dynamic objects involved break the geometric and photometric consistency, whereas the consistency lays the foundation for joint optimizing the camera pose and the map parameters. In this paper, we best exploit the characteristics of continual learning and propose a novel SLAM framework for dynamic environments. While past efforts have been made to avoid catastrophic forgetting by exploiting an experience replay strategy, we view forgetting as a desirable characteristic. By adaptively controlling the replayed buffer, the ambiguity caused by moving objects can be easily alleviated through forgetting. We restrain the replay of the dynamic objects by introducing a continually-learned classifier for dynamic object identification. The iterative optimization of the neural map and the classifier notably improves the robustness of the SLAM system under a dynamic environment. Experiments on challenging datasets verify the effectiveness of the proposed framework.

CVJan 14
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

Bei Huang, Yixin Chen, Ruijie Lu et al.

3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.

CVOct 13, 2023
Online Adaptive Disparity Estimation for Dynamic Scenes in Structured Light Systems

Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha

In recent years, deep neural networks have shown remarkable progress in dense disparity estimation from dynamic scenes in monocular structured light systems. However, their performance significantly drops when applied in unseen environments. To address this issue, self-supervised online adaptation has been proposed as a solution to bridge this performance gap. Unlike traditional fine-tuning processes, online adaptation performs test-time optimization to adapt networks to new domains. Therefore, achieving fast convergence during the adaptation process is critical for attaining satisfactory accuracy. In this paper, we propose an unsupervised loss function based on long sequential inputs. It ensures better gradient directions and faster convergence. Our loss function is designed using a multi-frame pattern flow, which comprises a set of sparse trajectories of the projected pattern along the sequence. We estimate the sparse pseudo ground truth with a confidence mask using a filter-based method, which guides the online adaptation process. Our proposed framework significantly improves the online adaptation speed and achieves superior performance on unseen data.

CVMay 20, 2024
Depth Reconstruction with Neural Signed Distance Fields in Structured Light Systems

Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha

We introduce a novel depth estimation technique for multi-frame structured light setups using neural implicit representations of 3D space. Our approach employs a neural signed distance field (SDF), trained through self-supervised differentiable rendering. Unlike passive vision, where joint estimation of radiance and geometry fields is necessary, we capitalize on known radiance fields from projected patterns in structured light systems. This enables isolated optimization of the geometry field, ensuring convergence and network efficacy with fixed device positioning. To enhance geometric fidelity, we incorporate an additional color loss based on object surfaces during training. Real-world experiments demonstrate our method's superiority in geometric performance for few-shot scenarios, while achieving comparable results with increased pattern availability.

CVApr 10
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Shunkai Zhou, Zike Yan, Fei Xue et al.

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

CVOct 17, 2025
Proactive Scene Decomposition and Reconstruction

Baicheng Li, Zike Yan, Dong Wu et al.

Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.

CVOct 1, 2025
Multi-level Dynamic Style Transfer for NeRFs

Zesheng Li, Shuaibo Li, Wei Ma et al.

As the application of neural radiance fields (NeRFs) in various 3D vision tasks continues to expand, numerous NeRF-based style transfer techniques have been developed. However, existing methods typically integrate style statistics into the original NeRF pipeline, often leading to suboptimal results in both content preservation and artistic stylization. In this paper, we present multi-level dynamic style transfer for NeRFs (MDS-NeRF), a novel approach that reengineers the NeRF pipeline specifically for stylization and incorporates an innovative dynamic style injection module. Particularly, we propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field, effectively capturing the multi-scale spatial structure of the scene. In addition, we present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns. The stylized multi-level features are then transformed into the final stylized view through our proposed multi-level cascade decoder. Furthermore, we extend our 3D style transfer method to support omni-view style transfer using 3D style references. Extensive experiments demonstrate that MDS-NeRF achieves outstanding performance for 3D style transfer, preserving multi-scale spatial structures while effectively transferring stylistic characteristics.

CVSep 19, 2025
TrueMoE: Dual-Routing Mixture of Discriminative Experts for Synthetic Image Detection

Laixin Zhang, Shuaibo Li, Wei Ma et al.

The rapid progress of generative models has made synthetic image detection an increasingly critical task. Most existing approaches attempt to construct a single, universal discriminative space to separate real from fake content. However, such unified spaces tend to be complex and brittle, often struggling to generalize to unseen generative patterns. In this work, we propose TrueMoE, a novel dual-routing Mixture-of-Discriminative-Experts framework that reformulates the detection task as a collaborative inference across multiple specialized and lightweight discriminative subspaces. At the core of TrueMoE is a Discriminative Expert Array (DEA) organized along complementary axes of manifold structure and perceptual granularity, enabling diverse forgery cues to be captured across subspaces. A dual-routing mechanism, comprising a granularity-aware sparse router and a manifold-aware dense router, adaptively assigns input images to the most relevant experts. Extensive experiments across a wide spectrum of generative models demonstrate that TrueMoE achieves superior generalization and robustness.

CVDec 21, 2021
Transferable End-to-end Room Layout Estimation via Implicit Encoding

Hao Zhao, Rene Ranftl, Yurong Chen et al.

We study the problem of estimating room layouts from a single panorama image. Most former works have two stages: feature extraction and parametric model fitting. Here we propose an end-to-end method that directly predicts parametric layouts from an input panorama image. It exploits an implicit encoding procedure that embeds parametric layouts into a latent space. Then learning a mapping from images to this latent space makes end-to-end room layout estimation possible. However end-to-end methods have several notorious drawbacks despite many intriguing properties. A widely raised criticism is that they are troubled with dataset bias and do not transfer to unfamiliar domains. Our study echos this common belief. To this end, we propose to use semantic boundary prediction maps as an intermediate domain. It brings significant performance boost on four benchmarks (Structured3D, PanoContext, S3DIS, and Matterport3D), notably in the zero-shot transfer setting. Code, data, and models will be released.

CVAug 12, 2021
Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

Zike Yan, Yuxin Tian, Xuesong Shi et al.

Recent advances have enabled a single neural network to serve as an implicit scene representation, establishing the mapping function between spatial coordinates and scene properties. In this paper, we make a further step towards continual learning of the implicit scene representation directly from sequential observations, namely Continual Neural Mapping. The proposed problem setting bridges the gap between batch-trained implicit neural representations and commonly used streaming data in robotics and vision communities. We introduce an experience replay approach to tackle an exemplary task of continual neural mapping: approximating a continuous signed distance function (SDF) from sequential depth images as a scene geometry representation. We show for the first time that a single network can represent scene geometry over time continually without catastrophic forgetting, while achieving promising trade-offs between accuracy and efficiency.

CVMar 31, 2021
Online Learning of a Probabilistic and Adaptive Scene Representation

Zike Yan, Xin Wang, Hongbin Zha

Constructing and maintaining a consistent scene model on-the-fly is the core task for online spatial perception, interpretation, and action. In this paper, we represent the scene with a Bayesian nonparametric mixture model, seamlessly describing per-point occupancy status with a continuous probability density function. Instead of following the conventional data fusion paradigm, we address the problem of online learning the process how sequential point cloud data are generated from the scene geometry. An incremental and parallel inference is performed to update the parameter space in real-time. We experimentally show that the proposed representation achieves state-of-the-art accuracy with promising efficiency. The consistent probabilistic formulation assures a generative model that is adaptive to different sensor characteristics, and the model complexity can be dynamically adjusted on-the-fly according to different data scales.

CVMar 29, 2021
Generalizing to the Open World: Deep Visual Odometry with Online Adaptation

Shunkai Li, Xin Wu, Yingdian Cao et al.

Despite learning-based visual odometry (VO) has shown impressive results in recent years, the pretrained networks may easily collapse in unseen environments. The large domain gap between training and testing data makes them difficult to generalize to new scenes. In this paper, we propose an online adaptation framework for deep VO with the assistance of scene-agnostic geometric computations and Bayesian inference. In contrast to learning-based pose estimation, our method solves pose from optical flow and depth while the single-view depth estimation is continuously improved with new observations by online learned uncertainties. Meanwhile, an online learned photometric uncertainty is used for further depth and pose optimization by a differentiable Gauss-Newton layer. Our method enables fast adaptation of deep VO networks to unseen environments in a self-supervised manner. Extensive experiments including Cityscapes to KITTI and outdoor KITTI to indoor TUM demonstrate that our method achieves state-of-the-art generalization ability among self-supervised VO methods.

CVSep 21, 2020
Line Flow based SLAM

Qiuyuan Wang, Zike Yan, Junqiu Wang et al.

We propose a visual SLAM method by predicting and updating line flows that represent sequential 2D projections of 3D line segments. While feature-based SLAM methods have achieved excellent results, they still face problems in challenging scenes containing occlusions, blurred images, and repetitive textures. To address these problems, we leverage a line flow to encode the coherence of line segment observations of the same 3D line along the temporal dimension, which has been neglected in prior SLAM systems. Thanks to this line flow representation, line segments in a new frame can be predicted according to their corresponding 3D lines and their predecessors along the temporal dimension. We create, update, merge, and discard line flows on-the-fly. We model the proposed line flow based SLAM (LF-SLAM) using a Bayesian network. Extensive experimental results demonstrate that the proposed LF-SLAM method achieves state-of-the-art results due to the utilization of line flows. Specifically, LF-SLAM obtains good localization and mapping results in challenging scenes with occlusions, blurred images, and repetitive textures.

ROAug 2, 2020
Deep Visual Odometry with Adaptive Memory

Fei Xue, Xin Wang, Junqiu Wang et al.

We propose a novel deep visual odometry (VO) method that considers global information by selecting memory and refining poses. Existing learning-based methods take the VO task as a pure tracking problem via recovering camera poses from image snippets, leading to severe error accumulation. Global information is crucial for alleviating accumulated errors. However, it is challenging to effectively preserve such information for end-to-end systems. To deal with this challenge, we design an adaptive memory module, which progressively and adaptively saves the information from local to global in a neural analogue of memory, enabling our system to process long-term dependency. Benefiting from global information in the memory, previous results are further refined by an additional refining module. With the guidance of previous outputs, we adopt a spatial-temporal attention to select features for each view based on the co-visibility in feature domain. Specifically, our architecture consisting of Tracking, Remembering and Refining modules works beyond tracking. Experiments on the KITTI and TUM-RGBD datasets demonstrate that our approach outperforms state-of-the-art methods by large margins and produces competitive results against classic approaches in regular scenes. Moreover, our model achieves outstanding performance in challenging scenarios such as texture-less regions and abrupt motions, where classic algorithms tend to fail.

CVMay 13, 2020
Self-Supervised Deep Visual Odometry with Online Adaptation

Shunkai Li, Xin Wang, Yingdian Cao et al.

Self-supervised VO methods have shown great success in jointly estimating camera pose and depth from videos. However, like most data-driven methods, existing VO networks suffer from a notable decrease in performance when confronted with scenes different from the training data, which makes them unsuitable for practical applications. In this paper, we propose an online meta-learning algorithm to enable VO networks to continuously adapt to new environments in a self-supervised manner. The proposed method utilizes convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the past. The network is able to memorize and learn from its past experience for better estimation and fast adaptation to the current frame. When running VO in the open world, in order to deal with the changing environment, we propose an online feature alignment method by aligning feature distributions at different time. Our VO network is able to seamlessly adapt to different environments. Extensive experiments on unseen outdoor scenes, virtual to real world and outdoor to indoor environments demonstrate that our method consistently outperforms state-of-the-art self-supervised VO baselines considerably.

CVAug 23, 2019
Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry

Shunkai Li, Fei Xue, Xin Wang et al.

We propose a self-supervised learning framework for visual odometry (VO) that incorporates correlation of consecutive frames and takes advantage of adversarial learning. Previous methods tackle self-supervised VO as a local structure from motion (SfM) problem that recovers depth from single image and relative poses from image pairs by minimizing photometric loss between warped and captured images. As single-view depth estimation is an ill-posed problem, and photometric loss is incapable of discriminating distortion artifacts of warped images, the estimated depth is vague and pose is inaccurate. In contrast to previous methods, our framework learns a compact representation of frame-to-frame correlation, which is updated by incorporating sequential information. The updated representation is used for depth estimation. Besides, we tackle VO as a self-supervised image generation task and take advantage of Generative Adversarial Networks (GAN). The generator learns to estimate depth and pose to generate a warped target image. The discriminator evaluates the quality of generated image with high-level structural perception that overcomes the problem of pixel-wise loss in previous methods. Experiments on KITTI and Cityscapes datasets show that our method obtains more accurate depth with details preserved and predicted pose outperforms state-of-the-art self-supervised methods significantly.

CVAug 6, 2019
Local Supports Global: Deep Camera Relocalization with Sequence Enhancement

Fei Xue, Xin Wang, Zike Yan et al.

We propose to leverage the local information in image sequences to support global camera relocalization. In contrast to previous methods that regress global poses from single images, we exploit the spatial-temporal consistency in sequential images to alleviate uncertainty due to visual ambiguities by incorporating a visual odometry (VO) component. Specifically, we introduce two effective steps called content-augmented pose estimation and motion-based refinement. The content-augmentation step focuses on alleviating the uncertainty of pose estimation by augmenting the observation based on the co-visibility in local maps built by the VO stream. Besides, the motion-based refinement is formulated as a pose graph, where the camera poses are further optimized by adopting relative poses provided by the VO component as additional motion constraints. Thus, the global consistency can be guaranteed. Experiments on the public indoor 7-Scenes and outdoor Oxford RobotCar benchmark datasets demonstrate that benefited from local information inherent in the sequence, our approach outperforms state-of-the-art methods, especially in some challenging cases, e.g., insufficient texture, highly repetitive textures, similar appearances, and over-exposure.

CVApr 15, 2019
Deep Comprehensive Correlation Mining for Image Clustering

Jianlong Wu, Keyu Long, Fei Wang et al.

Recent developed deep unsupervised methods allow us to jointly learn representation and cluster unlabelled data. These deep clustering methods mainly focus on the correlation among samples, e.g., selecting high precision pairs to gradually tune the feature representation, which neglects other useful correlations. In this paper, we propose a novel clustering framework, named deep comprehensive correlation mining(DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: 1) Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features. 2) The features' robustness to image transformation of input space is fully explored, which benefits the network learning and significantly improves the performance. 3) The triplet mutual information among features is presented for clustering problem to lift the recently discovered instance-level deep mutual information to a triplet-level formation, which further helps to learn more discriminative features. Extensive experiments on several challenging datasets show that our method achieves good performance, e.g., attaining $62.3\%$ clustering accuracy on CIFAR-10, which is $10.1\%$ higher than the state-of-the-art results.

CVApr 3, 2019
Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry

Fei Xue, Xin Wang, Shunkai Li et al.

Most previous learning-based visual odometry (VO) methods take VO as a pure tracking problem. In contrast, we present a VO framework by incorporating two additional components called Memory and Refining. The Memory component preserves global information by employing an adaptive and efficient selection strategy. The Refining component ameliorates previous results with the contexts stored in the Memory by adopting a spatial-temporal attention mechanism for feature distilling. Experiments on the KITTI and TUM-RGBD benchmark datasets demonstrate that our method outperforms state-of-the-art learning-based methods by a large margin and produces competitive results against classic monocular VO approaches. Especially, our model achieves outstanding performance in challenging scenarios such as texture-less regions and abrupt motions, where classic VO algorithms tend to fail.

CVNov 25, 2018
Guided Feature Selection for Deep Visual Odometry

Fei Xue, Qiuyuan Wang, Xin Wang et al.

We present a novel end-to-end visual odometry architecture with guided feature selection based on deep convolutional recurrent neural networks. Different from current monocular visual odometry methods, our approach is established on the intuition that features contribute discriminately to different motion patterns. Specifically, we propose a dual-branch recurrent network to learn the rotation and translation separately by leveraging current Convolutional Neural Network (CNN) for feature representation and Recurrent Neural Network (RNN) for image sequence reasoning. To enhance the ability of feature selection, we further introduce an effective context-aware guidance mechanism to force each branch to distill related information for specific motion pattern explicitly. Experiments demonstrate that on the prevalent KITTI and ICL_NUIM benchmarks, our method outperforms current state-of-the-art model- and learning-based methods for both decoupled and joint camera pose recovery.

ROJul 29, 2018
PSDF Fusion: Probabilistic Signed Distance Function for On-the-fly 3D Data Fusion and Scene Reconstruction

Wei Dong, Qiuyuan Wang, Xin Wang et al.

We propose a novel 3D spatial representation for data fusion and scene reconstruction. Probabilistic Signed Distance Function (Probabilistic SDF, PSDF) is proposed to depict uncertainties in the 3D space. It is modeled by a joint distribution describing SDF value and its inlier probability, reflecting input data quality and surface geometry. A hybrid data structure involving voxel, surfel, and mesh is designed to fully exploit the advantages of various prevalent 3D representations. Connected by PSDF, these components reasonably cooperate in a consistent frame- work. Given sequential depth measurements, PSDF can be incrementally refined with less ad hoc parametric Bayesian updating. Supported by PSDF and the efficient 3D data representation, high-quality surfaces can be extracted on-the-fly, and in return contribute to reliable data fu- sion using the geometry information. Experiments demonstrate that our system reconstructs scenes with higher model quality and lower redundancy, and runs faster than existing online mesh generation systems.

CVJul 16, 2018
Recurrent Squeeze-and-Excitation Context Aggregation Net for Single Image Deraining

Xia Li, Jianlong Wu, Zhouchen Lin et al.

Rain streaks can severely degrade the visibility, which causes many current computer vision algorithms fail to work. So it is necessary to remove the rain from images. We propose a novel deep network architecture based on deep convolutional and recurrent neural networks for single image deraining. As contextual information is very important for rain removal, we first adopt the dilated convolutional neural network to acquire large receptive field. To better fit the rain removal task, we also modify the network. In heavy rain, rain streaks have various directions and shapes, which can be regarded as the accumulation of multiple rain streak layers. We assign different alpha-values to various rain streak layers according to the intensity and transparency by incorporating the squeeze-and-excitation block. Since rain streak layers overlap with each other, it is not easy to remove the rain in one stage. So we further decompose the rain removal into multiple stages. Recurrent neural network is incorporated to preserve the useful information in previous stages and benefit the rain removal in later stages. We conduct extensive experiments on both synthetic and real-world datasets. Our proposed method outperforms the state-of-the-art approaches under all evaluation metrics. Codes and supplementary material are available at our project webpage: https://xialipku.github.io/RESCAN .

CVJul 10, 2018
Essential Tensor Learning for Multi-view Spectral Clustering

Jianlong Wu, Zhouchen Lin, Hongbin Zha

Multi-view clustering attracts much attention recently, which aims to take advantage of multi-view information to improve the performance of clustering. However, most recent work mainly focus on self-representation based subspace clustering, which is of high computation complexity. In this paper, we focus on the Markov chain based spectral clustering method and propose a novel essential tensor learning method to explore the high order correlations for multi-view representation. We first construct a tensor based on multi-view transition probability matrices of the Markov chain. By incorporating the idea from robust principle component analysis, tensor singular value decomposition (t-SVD) based tensor nuclear norm is imposed to preserve the low-rank property of the essential tensor, which can well capture the principle information from multiple views. We also employ the tensor rotation operator for this task to better investigate the relationship among views as well as reduce the computation complexity. The proposed method can be efficiently optimized by the alternating direction method of multipliers~(ADMM). Extensive experiments on six real world datasets corresponding to five different applications show that our method achieves superior performance over other state-of-the-art methods.

ROMar 11, 2018
An Efficient Volumetric Mesh Representation for Real-time Scene Reconstruction using Spatial Hashing

Wei Dong, Jieqi Shi, Weijie Tang et al.

Mesh plays an indispensable role in dense real-time reconstruction essential in robotics. Efforts have been made to maintain flexible data structures for 3D data fusion, yet an efficient incremental framework specifically designed for online mesh storage and manipulation is missing. We propose a novel framework to compactly generate, update, and refine mesh for scene reconstruction upon a volumetric representation. Maintaining a spatial-hashed field of cubes, we distribute vertices with continuous value on discrete edges that support O(1) vertex accessing and forbid memory redundancy. By introducing Hamming distance in mesh refinement, we further improve the mesh quality regarding the triangle type consistency with a low cost. Lock-based and lock-free operations were applied to avoid thread conflicts in GPU parallel computation. Experiments demonstrate that the mesh memory consumption is significantly reduced while the running speed is kept in the online reconstruction process.

LGFeb 1, 2018
Alternating Multi-bit Quantization for Recurrent Neural Networks

Chen Xu, Jianqiang Yao, Zhouchen Lin et al.

Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+1}. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied. We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve ~16x memory saving and ~6x real inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with ~10.5x memory saving and ~3x real inference acceleration. Both results beat the exiting quantization works with large margins. We extend our alternating quantization to image classification tasks. In both RNNs and feedforward neural networks, the method also achieves excellent performance.

MLNov 25, 2016
A Unified Convex Surrogate for the Schatten-$p$ Norm

Chen Xu, Zhouchen Lin, Hongbin Zha

The Schatten-$p$ norm ($0<p<1$) has been widely used to replace the nuclear norm for better approximating the rank function. However, existing methods are either 1) not scalable for large scale problems due to relying on singular value decomposition (SVD) in every iteration, or 2) specific to some $p$ values, e.g., $1/2$, and $2/3$. In this paper, we show that for any $p$, $p_1$, and $p_2 >0$ satisfying $1/p=1/p_1+1/p_2$, there is an equivalence between the Schatten-$p$ norm of one matrix and the Schatten-$p_1$ and the Schatten-$p_2$ norms of its two factor matrices. We further extend the equivalence to multiple factor matrices and show that all the factor norms can be convex and smooth for any $p>0$. In contrast, the original Schatten-$p$ norm for $0<p<1$ is non-convex and non-smooth. As an example we conduct experiments on matrix completion. To utilize the convexity of the factor matrix norms, we adopt the accelerated proximal alternating linearized minimization algorithm and establish its sequence convergence. Experiments on both synthetic and real datasets exhibit its superior performance over the state-of-the-art methods. Its speed is also highly competitive.

OCNov 25, 2015
Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization

Chen Xu, Zhouchen Lin, Zhenyu Zhao et al.

We propose a new majorization-minimization (MM) method for non-smooth and non-convex programs, which is general enough to include the existing MM methods. Besides the local majorization condition, we only require that the difference between the directional derivatives of the objective function and its surrogate function vanishes when the number of iterations approaches infinity, which is a very weak condition. So our method can use a surrogate function that directly approximates the non-smooth objective function. In comparison, all the existing MM methods construct the surrogate function by approximating the smooth component of the objective function. We apply our relaxed MM methods to the robust matrix factorization (RMF) problem with different regularizations, where our locally majorant algorithm shows advantages over the state-of-the-art approaches for RMF. This is the first algorithm for RMF ensuring, without extra assumptions, that any limit point of the iterates is a stationary point.

LGJun 18, 2012
Total Variation and Euler's Elastica for Supervised Learning

Tong Lin, Hanlin Xue, Ling Wang et al.

In recent years, total variation (TV) and Euler's elastica (EE) have been successfully applied to image processing tasks such as denoising and inpainting. This paper investigates how to extend TV and EE to the supervised learning settings on high dimensional data. The supervised learning problem can be formulated as an energy functional minimization under Tikhonov regularization scheme, where the energy is composed of a squared loss and a total variation smoothing (or Euler's elastica smoothing). Its solution via variational principles leads to an Euler-Lagrange PDE. However, the PDE is always high-dimensional and cannot be directly solved by common methods. Instead, radial basis functions are utilized to approximate the target function, reducing the problem to finding the linear coefficients of basis functions. We apply the proposed methods to supervised learning tasks (including binary classification, multi-class classification, and regression) on benchmark data sets. Extensive experiments have demonstrated promising results of the proposed methods.