CVSep 25, 2023Code
IEBins: Iterative Elastic Bins for Monocular Depth EstimationShuwei Shao, Zhongcai Pei, Xingming Wu et al.
Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at https://github.com/ShuweiShao/IEBins.
CVJun 21, 2023Code
Online Unsupervised Video Object Segmentation via Contrastive Motion ClusteringLin Xi, Weihai Chen, Xingming Wu et al.
Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.
CVApr 7, 2023
ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable TransformationXiaoming Zhao, Xingming Wu, Weihai Chen et al.
Image keypoints and descriptors play a crucial role in many visual measurement tasks. In recent years, deep neural networks have been widely used to improve the performance of keypoint and descriptor extraction. However, the conventional convolution operations do not provide the geometric invariance required for the descriptor. To address this issue, we propose the Sparse Deformable Descriptor Head (SDDH), which learns the deformable positions of supporting features for each keypoint and constructs deformable descriptors. Furthermore, SDDH extracts descriptors at sparse keypoints instead of a dense descriptor map, which enables efficient extraction of descriptors with strong expressiveness. In addition, we relax the neural reprojection error (NRE) loss from dense to sparse to train the extracted sparse descriptors. Experimental results show that the proposed network is both efficient and powerful in various visual measurement tasks, including image matching, 3D reconstruction, and visual relocalization.
CVMar 16, 2023
Image Enhancement for Remote Photoplethysmography in a Low-Light EnvironmentLin Xi, Weihai Chen, Changchen Zhao et al.
With the improvement of sensor technology and significant algorithmic advances, the accuracy of remote heart rate monitoring technology has been significantly improved. Despite of the significant algorithmic advances, the performance of rPPG algorithm can degrade in the long-term, high-intensity continuous work occurred in evenings or insufficient light environments. One of the main challenges is that the lost facial details and low contrast cause the failure of detection and tracking. Also, insufficient lighting in video capturing hurts the quality of physiological signal. In this paper, we collect a large-scale dataset that was designed for remote heart rate estimation recorded with various illumination variations to evaluate the performance of the rPPG algorithm (Green, ICA, and POS). We also propose a low-light enhancement solution (technical solution) for remote heart rate estimation under the low-light condition. Using collected dataset, we found 1) face detection algorithm cannot detect faces in video captured in low light conditions; 2) A decrease in the amplitude of the pulsatile signal will lead to the noise signal to be in the dominant position; and 3) the chrominance-based method suffers from the limitation in the assumption about skin-tone will not hold, and Green and ICA method receive less influence than POS in dark illuminance environment. The proposed solution for rPPG process is effective to detect and improve the signal-to-noise ratio and precision of the pulsatile signal.
CVSep 19, 2023
NDDepth: Normal-Distance Assisted Monocular Depth EstimationShuwei Shao, Zhongcai Pei, Weihai Chen et al.
Monocular depth estimation has drawn widespread attention from the vision community due to its broad applications. In this paper, we propose a novel physics (geometry)-driven deep learning framework for monocular depth estimation by assuming that 3D scenes are constituted by piece-wise planes. Particularly, we introduce a new normal-distance head that outputs pixel-level surface normal and plane-to-origin distance for deriving depth at each position. Meanwhile, the normal and distance are regularized by a developed plane-aware consistency constraint. We further integrate an additional depth head to improve the robustness of the proposed framework. To fully exploit the strengths of these two heads, we develop an effective contrastive iterative refinement module that refines depth in a complementary manner according to the depth uncertainty. Extensive experiments indicate that the proposed method exceeds previous state-of-the-art competitors on the NYU-Depth-v2, KITTI and SUN RGB-D datasets. Notably, it ranks 1st among all submissions on the KITTI depth prediction online benchmark at the submission time.
CVFeb 20, 2023
Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset RefinementZhong Liu, Ran Li, Shuwei Shao et al.
Monocular depth estimation plays a fundamental role in computer vision. Due to the costly acquisition of depth ground truth, self-supervised methods that leverage adjacent frames to establish a supervisory signal have emerged as the most promising paradigms. In this work, we propose two novel ideas to improve self-supervised monocular depth estimation: 1) self-reference distillation and 2) disparity offset refinement. Specifically, we use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision during the training process. The teacher model has the same structure as the student model, with weights inherited from the historical student model. In addition, a multiview check is introduced to filter out the outliers produced by the teacher model. Furthermore, we leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets, which are used to refine the disparity output incrementally by aligning disparity information at different scales. The experimental results on the KITTI and Make3D datasets show that our method outperforms previous state-of-the-art competitors.
CVApr 6, 2022
Implicit Motion-Compensated Network for Unsupervised Video Object SegmentationLin Xi, Weihai Chen, Xingming Wu et al.
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence. Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based). To overcome the limitations, we propose an implicit motion-compensated network (IMCNet) combining complementary cues ($\textit{i.e.}$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level without estimating optical flows. The proposed IMCNet consists of an affinity computing module (ACM), an attention propagation module (APM), and a motion compensation module (MCM). The light-weight ACM extracts commonality between neighboring input frames based on appearance features. The APM then transmits global correlation in a top-down manner. Through coarse-to-fine iterative inspiring, the APM will refine object regions from multiple resolutions so as to efficiently avoid losing details. Finally, the MCM aligns motion information from temporally adjacent frames to the current frame which achieves implicit motion compensation at the feature level. We perform extensive experiments on $\textit{DAVIS}_{\textit{16}}$ and $\textit{YouTube-Objects}$. Our network achieves favorable performance while running at a faster speed compared to the state-of-the-art methods.
CVMay 30, 2022
Learnable Patchmatch and Self-Teaching for Multi-Frame Depth Estimation in Monocular EndoscopyShuwei Shao, Zhongcai Pei, Weihai Chen et al.
This work delves into unsupervised monocular depth estimation in endoscopy, which leverages adjacent frames to establish a supervisory signal during the training phase. For many clinical applications, e.g., surgical navigation, temporally correlated frames are also available at test time. Due to the lack of depth clues, making full use of the temporal correlation among multiple video frames at both phases is crucial for accurate depth estimation. However, several challenges in endoscopic scenes, such as low and homogeneous textures and inter-frame brightness fluctuations, limit the performance gain from the temporal correlation. To fully exploit it, we propose a novel unsupervised multi-frame monocular depth estimation model. The proposed model integrates a learnable patchmatch module to adaptively increase the discriminative ability in regions with low and homogeneous textures, and enforces cross-teaching and self-teaching consistencies to provide efficacious regularizations towards brightness fluctuations. Furthermore, as a byproduct of the self-teaching paradigm, the proposed model is able to improve the depth predictions when more frames are input at test time. We conduct detailed experiments on multiple datasets, including SCARED, EndoSLAM, Hamlyn and SERV-CT. The experimental results indicate that our model exceeds the state-of-the-art competitors. The source code and trained models will be publicly available upon the acceptance.
CVNov 20, 2022
Real-time Local Feature with Global Visual Information EnhancementJinyu Miao, Haosong Yue, Zhong Liu et al.
Local feature provides compact and invariant image representation for various visual tasks. Current deep learning-based local feature algorithms always utilize convolution neural network (CNN) architecture with limited receptive field. Besides, even with high-performance GPU devices, the computational efficiency of local features cannot be satisfactory. In this paper, we tackle such problems by proposing a CNN-based local feature algorithm. The proposed method introduces a global enhancement module to fuse global visual clues in a light-weight network, and then optimizes the network by novel deep reinforcement learning scheme from the perspective of local feature matching task. Experiments on the public benchmarks demonstrate that the proposal can achieve considerable robustness against visual interference and meanwhile run in real time.
CVDec 15, 2021Code
Self-Supervised Monocular Depth and Ego-Motion Estimation in Endoscopy: Appearance Flow to the RescueShuwei Shao, Zhongcai Pei, Weihai Chen et al.
Recently, self-supervised learning technology has been applied to calculate depth and ego-motion from monocular videos, achieving remarkable performance in autonomous driving scenarios. One widely adopted assumption of depth and ego-motion self-supervised learning is that the image brightness remains constant within nearby frames. Unfortunately, the endoscopic scene does not meet this assumption because there are severe brightness fluctuations induced by illumination variations, non-Lambertian reflections and interreflections during data collection, and these brightness fluctuations inevitably deteriorate the depth and ego-motion estimation accuracy. In this work, we introduce a novel concept referred to as appearance flow to address the brightness inconsistency problem. The appearance flow takes into consideration any variations in the brightness pattern and enables us to develop a generalized dynamic image constraint. Furthermore, we build a unified self-supervised framework to estimate monocular depth and ego-motion simultaneously in endoscopic scenes, which comprises a structure module, a motion module, an appearance module and a correspondence module, to accurately reconstruct the appearance and calibrate the image brightness. Extensive experiments are conducted on the SCARED dataset and EndoSLAM dataset, and the proposed unified framework exceeds other self-supervised approaches by a large margin. To validate our framework's generalization ability on different patients and cameras, we train our model on SCARED but test it on the SERV-CT and Hamlyn datasets without any fine-tuning, and the superior results reveal its strong generalization ability. Code will be available at: \url{https://github.com/ShuweiShao/AF-SfMLearner}.
CVNov 20, 2021Code
FAMINet: Learning Real-time Semi-supervised Video Object Segmentation with Steepest Optimized Optical FlowZiyang Liu, Jingmeng Liu, Weihai Chen et al.
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame. The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy. However, the optical flow-based semi-supervised VOS methods cannot run in real time due to high complexity of optical flow estimation. A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem. The appearance network outputs an initial segmentation result based on static appearances of objects. The motion network estimates the optical flow via very few parameters, which are optimized rapidly by an online memorizing algorithm named relaxed steepest descent. The integration network refines the initial segmentation result using the optical flow. Extensive experiments demonstrate that the FAMINet outperforms other state-of-the-art semi-supervised VOS methods on the DAVIS and YouTube-VOS benchmarks, and it achieves a good trade-off between accuracy and efficiency. Our code is available at https://github.com/liuziyang123/FAMINet.
CVDec 31, 2021
Sparse LiDAR Assisted Self-supervised Stereo Disparity EstimationXiaoming Zhao, Weihai Chen, Xingming Wu et al.
Deep stereo matching has made significant progress in recent years. However, state-of-the-art methods are based on expensive 4D cost volume, which limits their use in real-world applications. To address this issue, 3D correlation maps and iterative disparity updates have been proposed. Regarding that in real-world platforms, such as self-driving cars and robots, the Lidar is usually installed. Thus we further introduce the sparse Lidar point into the iterative updates, which alleviates the burden of network updating the disparity from zero states. Furthermore, we propose training the network in a self-supervised way so that it can be trained on any captured data for better generalization ability. Experiments and comparisons show that the presented method is effective and achieves comparable results with related methods.
IVDec 25, 2021
DSRGAN: Detail Prior-Assisted Perceptual Single Image Super-Resolution via Generative Adversarial NetworksZiyang Liu, Zhengguo Li, Xingming Wu et al.
The generative adversarial network (GAN) is successfully applied to study the perceptual single image superresolution (SISR). However, the GAN often tends to generate images with high frequency details being inconsistent with the real ones. Inspired by conventional detail enhancement algorithms, we propose a novel prior knowledge, the detail prior, to assist the GAN in alleviating this problem and restoring more realistic details. The proposed method, named DSRGAN, includes a well designed detail extraction algorithm to capture the most important high frequency information from images. Then, two discriminators are utilized for supervision on image-domain and detail-domain restorations, respectively. The DSRGAN merges the restored detail into the final output via a detail enhancement manner. The special design of DSRGAN takes advantages from both the model-based conventional algorithm and the data-driven deep learning network. Experimental results demonstrate that the DSRGAN outperforms the state-of-the-art SISR methods on perceptual metrics and achieves comparable results in terms of fidelity metrics simultaneously. Following the DSRGAN, it is feasible to incorporate other conventional image processing algorithms into a deep learning network to form a model-based deep SISR.
CVDec 6, 2021
ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor ExtractionXiaoming Zhao, Xingming Wu, Jinyu Miao et al.
Existing methods detect the keypoints in a non-differentiable way, therefore they can not directly optimize the position of keypoints through back-propagation. To address this issue, we present a partially differentiable keypoint detection module, which outputs accurate sub-pixel keypoints. The reprojection loss is then proposed to directly optimize these sub-pixel keypoints, and the dispersity peak loss is presented for accurate keypoints regularization. We also extract the descriptors in a sub-pixel way, and they are trained with the stable neural reprojection error loss. Moreover, a lightweight network is designed for keypoint detection and descriptor extraction, which can run at 95 frames per second for 640x480 images on a commercial GPU. On homography estimation, camera pose estimation, and visual (re-)localization tasks, the proposed method achieves equivalent performance with the state-of-the-art approaches, while greatly reduces the inference time.
CVNov 17, 2021
Probabilistic Spatial Distribution Prior Based Attentional Keypoints Matching NetworkXiaoming Zhao, Jingmeng Liu, Xingming Wu et al.
Keypoints matching is a pivotal component for many image-relevant applications such as image stitching, visual simultaneous localization and mapping (SLAM), and so on. Both handcrafted-based and recently emerged deep learning-based keypoints matching methods merely rely on keypoints and local features, while losing sight of other available sensors such as inertial measurement unit (IMU) in the above applications. In this paper, we demonstrate that the motion estimation from IMU integration can be used to exploit the spatial distribution prior of keypoints between images. To this end, a probabilistic perspective of attention formulation is proposed to integrate the spatial distribution prior into the attentional graph neural network naturally. With the assistance of spatial distribution prior, the effort of the network for modeling the hidden features can be reduced. Furthermore, we present a projection loss for the proposed keypoints matching network, which gives a smooth edge between matching and un-matching keypoints. Image matching experiments on visual SLAM datasets indicate the effectiveness and efficiency of the presented method.
CVNov 16, 2021
Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are Better Than OneShuwei Shao, Ran Li, Zhongcai Pei et al.
Depth estimation attracts widespread attention in the computer vision community. However, it is still quite difficult to recover an accurate depth map using only one RGB image. We observe a phenomenon that existing methods tend to fail in different cases, caused by differences in network architecture, loss function and so on. In this work, we investigate into the phenomenon and propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor, which is critical for many real-world applications, e.g., 3D reconstruction. Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures. Transformer establishes long-range correlation while CNN preserves local information ignored by Transformer due to the spatial inductive bias. Therefore, the coupling of Transformer and CNN contributes to the generation of complementary depth estimates, which are essential to achieve a comprehensive depth predictor. Then, we design mixers to learn from multiple weak predictions and adaptively fuse them into a strong depth estimate. The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth). On the standard NYU-Depth-v2 and KITTI datasets, we thoroughly explore how the neural ensembles affect the depth estimation and demonstrate that our TEDepth achieves better results than previous state-of-the-art approaches. To validate the generalizability across cameras, we directly apply the models trained on NYU-Depth-v2 to the SUN RGB-D dataset without any fine-tuning, and the superior results emphasize its strong generalizability.
IVNov 14, 2021
Deep Joint Demosaicing and High Dynamic Range Imaging within a Single ShotYilun Xu, Ziyang Liu, Xingming Wu et al.
Spatially varying exposure (SVE) is a promising choice for high-dynamic-range (HDR) imaging (HDRI). The SVE-based HDRI, which is called single-shot HDRI, is an efficient solution to avoid ghosting artifacts. However, it is very challenging to restore a full-resolution HDR image from a real-world image with SVE because: a) only one-third of pixels with varying exposures are captured by camera in a Bayer pattern, b) some of the captured pixels are over- and under-exposed. For the former challenge, a spatially varying convolution (SVC) is designed to process the Bayer images carried with varying exposures. For the latter one, an exposure-guidance method is proposed against the interference from over- and under-exposed pixels. Finally, a joint demosaicing and HDRI deep learning framework is formalized to include the two novel components and to realize an end-to-end single-shot HDRI. Experiments indicate that the proposed end-to-end framework avoids the problem of cumulative errors and surpasses the related state-of-the-art methods.
CVMar 18, 2021
Discriminative and Semantic Feature Selection for Place Recognition towards Dynamic EnvironmentsYuxin Tian, Jinyu MIao, Xingming Wu et al.
Features play an important role in various visual tasks, especially in visual place recognition applied in perceptual changing environments. In this paper, we address the challenges of place recognition due to dynamics and confusable patterns by proposing a discriminative and semantic feature selection network, dubbed as DSFeat. Supervised by both semantic information and attention mechanism, we can estimate pixel-wise stability of features, indicating the probability of a static and stable region from which features are extracted, and then select features that are insensitive to dynamic interference and distinguishable to be correctly matched. The designed feature selection model is evaluated in place recognition and SLAM system in several public datasets with varying appearances and viewpoints. Experimental results conclude that the effectiveness of the proposed method. It should be noticed that our proposal can be readily pluggable into any feature-based SLAM system.
IVJul 13, 2019
S&CNet: Monocular Depth Completion for Autonomous Systems and 3D ReconstructionLei Zhang, Weihai Chen, Chao Hu et al.
Dense depth completion is essential for autonomous systems and 3D reconstruction. In this paper, a lightweight yet efficient network (S\&CNet) is proposed to obtain a good trade-off between efficiency and accuracy for the dense depth completion. A dual-stream attention module (S\&C enhancer) is introduced to measure both spatial-wise and the channel-wise global-range relationship of extracted features so as to improve the performance. A coarse-to-fine network is designed and the proposed S\&C enhancer is plugged into the coarse estimation network between its encoder and decoder network. Experimental results demonstrate that our approach achieves competitive performance with existing works on KITTI dataset but almost four times faster. The proposed S\&C enhancer can be plugged into other existing works and boost their performance significantly with a negligible additional computational cost.