CVFeb 16, 2023Code
Parallax-Tolerant Unsupervised Deep Image StitchingLang Nie, Chunyu Lin, Kang Liao et al.
Traditional image stitching approaches tend to leverage increasingly complex geometric features (point, line, edge, etc.) for better performance. However, these hand-crafted features are only suitable for specific natural scenes with adequate geometric structures. In contrast, deep stitching schemes overcome the adverse conditions by adaptively learning robust semantic features, but they cannot handle large-parallax cases due to homography-based registration. To solve these issues, we propose UDIS++, a parallax-tolerant unsupervised deep image stitching technique. First, we propose a robust and flexible warp to model the image registration from global homography to local thin-plate spline motion. It provides accurate alignment for overlapping regions and shape preservation for non-overlapping regions by joint optimization concerning alignment and distortion. Subsequently, to improve the generalization capability, we design a simple but effective iterative strategy to enhance the warp adaption in cross-dataset and cross-resolution applications. Finally, to further eliminate the parallax artifacts, we propose to composite the stitched image seamlessly by unsupervised learning for seam-driven composition masks. Compared with existing methods, our solution is parallax-tolerant and free from laborious designs of complicated geometric features for specific scenes. Extensive experiments show our superiority over the SoTA methods, both quantitatively and qualitatively. The code is available at https://github.com/nie-lang/UDIS2.
CVSep 28, 2023Code
GAFlow: Incorporating Gaussian Attention into Optical FlowAo Luo, Fan Yang, Xin Li et al.
Optical flow, or the estimation of motion fields from image sequences, is one of the fundamental problems in computer vision. Unlike most pixel-wise tasks that aim at achieving consistent representations of the same category, optical flow raises extra demands for obtaining local discrimination and smoothness, which yet is not fully explored by existing approaches. In this paper, we push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning and enforce the motion affinity during matching. Specifically, we introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks to highlight the local neighborhood that contains fine-grained structural information. Moreover, for reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM) which not only inherits properties from Gaussian distribution to instinctively revolve around the neighbor fields of each point but also is empowered to put the emphasis on contextually related regions during matching. Our fully-equipped model, namely Gaussian Attention Flow network (GAFlow), naturally incorporates a series of novel Gaussian-based modules into the conventional optical flow framework for reliable motion analysis. Extensive experiments on standard optical flow datasets consistently demonstrate the exceptional performance of the proposed approach in terms of both generalization ability evaluation and online benchmark testing. Code is available at https://github.com/LA30/GAFlow.
CVMar 19, 2023Code
Deep Learning for Camera Calibration and Beyond: A SurveyKang Liao, Lang Nie, Shujuan Huang et al.
Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations. Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model, following the research trend and extended applications. As there is no unified benchmark in this community, we collect a holistic calibration dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration (spanned 10 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.
CVJul 7, 2022Code
Deep Rotation Correction without Angle PriorLang Nie, Chunyu Lin, Kang Liao et al.
Not everybody can be equipped with professional photography skills and sufficient shooting time, and there can be some tilts in the captured images occasionally. In this paper, we propose a new and practical task, named Rotation Correction, to automatically correct the tilt with high content fidelity in the condition that the rotated angle is unknown. This task can be easily integrated into image editing applications, allowing users to correct the rotated images without any manual operations. To this end, we leverage a neural network to predict the optical flows that can warp the tilted images to be perceptually horizontal. Nevertheless, the pixel-wise optical flow estimation from a single image is severely unstable, especially in large-angle tilted images. To enhance its robustness, we propose a simple but effective prediction strategy to form a robust elastic warp. Particularly, we first regress the mesh deformation that can be transformed into robust initial optical flows. Then we estimate residual optical flows to facilitate our network the flexibility of pixel-wise deformation, further correcting the details of the tilted images. To establish an evaluation benchmark and train the learning framework, a comprehensive rotation correction dataset is presented with a large diversity in scenes and rotated angles. Extensive experiments demonstrate that even in the absence of the angle prior, our algorithm can outperform other state-of-the-art solutions requiring this prior. The code and dataset are available at https://github.com/nie-lang/RotationCorrection.
CVOct 9, 2023Code
WeatherDepth: Curriculum Contrastive Learning for Self-Supervised Depth Estimation under Adverse Weather ConditionsJiyuan Wang, Chunyu Lin, Lang Nie et al.
Depth estimation models have shown promising performance on clear scenes but fail to generalize to adverse weather conditions due to illumination variations, weather particles, etc. In this paper, we propose WeatherDepth, a self-supervised robust depth estimation model with curriculum contrastive learning, to tackle performance degradation in complex weather conditions. Concretely, we first present a progressive curriculum learning scheme with three simple-to-complex curricula to gradually adapt the model from clear to relative adverse, and then to adverse weather scenes. It encourages the model to gradually grasp beneficial depth cues against the weather effect, yielding smoother and better domain adaption. Meanwhile, to prevent the model from forgetting previous curricula, we integrate contrastive learning into different curricula. By drawing reference knowledge from the previous course, our strategy establishes a depth consistency constraint between different courses toward robust depth estimation in diverse weather. Besides, to reduce manual intervention and better adapt to different models, we designed an adaptive curriculum scheduler to automatically search for the best timing for course switching. In the experiment, the proposed solution is proven to be easily incorporated into various architectures and demonstrates state-of-the-art (SoTA) performance on both synthetic and real weather datasets. Source code and data are available at \url{https://github.com/wangjiyuan9/WeatherDepth}.
CVMar 2, 2023Code
Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion AwarenessZhijie Shen, Zishuo Zheng, Chunyu Lin et al.
Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at \url{https://github.com/zhijieshen-bjtu/DOPNet}.
CVJan 4, 2023Code
RecRecNet: Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model and DoF-based Curriculum LearningKang Liao, Lang Nie, Chunyu Lin et al.
The wide-angle lens shows appealing applications in VR technologies, but it introduces severe radial distortion into its captured image. To recover the realistic scene, previous works devote to rectifying the content of the wide-angle image. However, such a rectification solution inevitably distorts the image boundary, which changes related geometric distributions and misleads the current vision perception models. In this work, we explore constructing a win-win representation on both content and boundary by contributing a new learning model, i.e., Rectangling Rectification Network (RecRecNet). In particular, we propose a thin-plate spline (TPS) module to formulate the non-linear and non-rigid transformation for rectangling images. By learning the control points on the rectified image, our model can flexibly warp the source structure to the target domain and achieves an end-to-end unsupervised deformation. To relieve the complexity of structure approximation, we then inspire our RecRecNet to learn the gradual deformation rules with a DoF (Degree of Freedom)-based curriculum learning. By increasing the DoF in each curriculum stage, namely, from similarity transformation (4-DoF) to homography transformation (8-DoF), the network is capable of investigating more detailed deformations, offering fast convergence on the final rectangling task. Experiments show the superiority of our solution over the compared methods on both quantitative and qualitative evaluations. The code and dataset are available at https://github.com/KangLiao929/RecRecNet.
CVAug 3, 2022Code
Neural Contourlet Network for Monocular 360 Depth EstimationZhijie Shen, Chunyu Lin, Lang Nie et al.
For a monocular 360 image, depth estimation is a challenging because the distortion increases along the latitude. To perceive the distortion, existing methods devote to designing a deep and complex network architecture. In this paper, we provide a new perspective that constructs an interpretable and sparse representation for a 360 image. Considering the importance of the geometric structure in depth estimation, we utilize the contourlet transform to capture an explicit geometric cue in the spectral domain and integrate it with an implicit cue in the spatial domain. Specifically, we propose a neural contourlet network consisting of a convolutional neural network and a contourlet transform branch. In the encoder stage, we design a spatial-spectral fusion module to effectively fuse two types of cues. Contrary to the encoder, we employ the inverse contourlet transform with learned low-pass subbands and band-pass directional subbands to compose the depth in the decoder. Experiments on the three popular panoramic image datasets demonstrate that the proposed approach outperforms the state-of-the-art schemes with faster convergence. Code is available at https://github.com/zhijieshen-bjtu/Neural-Contourlet-Network-for-MODE.
CVMar 8, 2022
Deep Rectangling for Image Stitching: A Learning BaselineLang Nie, Chunyu Lin, Kang Liao et al.
Stitched images provide a wide field-of-view (FoV) but suffer from unpleasant irregular boundaries. To deal with this problem, existing image rectangling methods devote to searching an initial mesh and optimizing a target mesh to form the mesh deformation in two stages. Then rectangular images can be generated by warping stitched images. However, these solutions only work for images with rich linear structures, leading to noticeable distortions for portraits and landscapes with non-linear objects. In this paper, we address these issues by proposing the first deep learning solution to image rectangling. Concretely, we predefine a rigid target mesh and only estimate an initial mesh to form the mesh deformation, contributing to a compact one-stage solution. The initial mesh is predicted using a fully convolutional network with a residual progressive regression strategy. To obtain results with high content fidelity, a comprehensive objective function is proposed to simultaneously encourage the boundary rectangular, mesh shape-preserving, and content perceptually natural. Besides, we build the first image stitching rectangling dataset with a large diversity in irregular boundaries and scenes. Experiments demonstrate our superiority over traditional methods both quantitatively and qualitatively.
CVAug 29, 2024Code
Revisiting 360 Depth Estimation with PanoGabor: A New Fusion PerspectiveZhijie Shen, Chunyu Lin, Lang Nie et al.
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations ({e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse
CVMar 17, 2022
PanoFormer: Panorama Transformer for Indoor 360 Depth EstimationZhijie Shen, Chunyu Lin, Kang Liao et al.
Existing panoramic depth estimation methods based on convolutional neural networks (CNNs) focus on removing panoramic distortions, failing to perceive panoramic structures efficiently due to the fixed receptive field in CNNs. This paper proposes the panorama transformer (named PanoFormer) to estimate the depth in panorama images, with tangent patches from spherical domain, learnable token flows, and panorama specific metrics. In particular, we divide patches on the spherical tangent domain into tokens to reduce the negative effect of panoramic distortions. Since the geometric structures are essential for depth estimation, a self-attention module is redesigned with an additional learnable token flow. In addition, considering the characteristic of the spherical domain, we present two panorama-specific metrics to comprehensively evaluate the panoramic depth estimation models' performance. Extensive experiments demonstrate that our approach significantly outperforms the state-of-the-art (SOTA) methods. Furthermore, the proposed method can be effectively extended to solve semantic panorama segmentation, a similar pixel2pixel task. Code will be available.
CVJul 6, 2022
Complementary Bi-directional Feature Compression for Indoor 360° Semantic Segmentation with Self-distillationZishuo Zheng, Chunyu Lin, Lang Nie et al.
Recently, horizontal representation-based panoramic semantic segmentation approaches outperform projection-based solutions, because the distortions can be effectively removed by compressing the spherical data in the vertical direction. However, these methods ignore the distortion distribution prior and are limited to unbalanced receptive fields, e.g., the receptive fields are sufficient in the vertical direction and insufficient in the horizontal direction. Differently, a vertical representation compressed in another direction can offer implicit distortion prior and enlarge horizontal receptive fields. In this paper, we combine the two different representations and propose a novel 360° semantic segmentation solution from a complementary perspective. Our network comprises three modules: a feature extraction module, a bi-directional compression module, and an ensemble decoding module. First, we extract multi-scale features from a panorama. Then, a bi-directional compression module is designed to compress features into two complementary low-dimensional representations, which provide content perception and distortion prior. Furthermore, to facilitate the fusion of bi-directional features, we design a unique self distillation strategy in the ensemble decoding module to enhance the interaction of different features and further improve the performance. Experimental results show that our approach outperforms the state-of-the-art solutions with at least 10\% improvement on quantitative evaluations while displaying the best performance on visual appearance.
CVMar 18, 2022
Distortion-Tolerant Monocular Depth Estimation On Omnidirectional Images Using Dual-cubemapZhijie Shen, Chunyu Lin, Lang Nie et al.
Estimating the depth of omnidirectional images is more challenging than that of normal field-of-view (NFoV) images because the varying distortion can significantly twist an object's shape. The existing methods suffer from troublesome distortion while estimating the depth of omnidirectional images, leading to inferior performance. To reduce the negative impact of the distortion influence, we propose a distortion-tolerant omnidirectional depth estimation algorithm using a dual-cubemap. It comprises two modules: Dual-Cubemap Depth Estimation (DCDE) module and Boundary Revision (BR) module. In DCDE module, we present a rotation-based dual-cubemap model to estimate the accurate NFoV depth, reducing the distortion at the cost of boundary discontinuity on omnidirectional depths. Then a boundary revision module is designed to smooth the discontinuous boundaries, which contributes to the precise and visually continuous omnidirectional depths. Extensive experiments demonstrate the superiority of our method over other state-of-the-art solutions.
CVFeb 20, 2023
Unsupervised OmniMVS: Efficient Omnidirectional Depth Inference via Establishing Pseudo-Stereo SupervisionZisong Chen, Chunyu Lin, Lang Nie et al.
Omnidirectional multi-view stereo (MVS) vision is attractive for its ultra-wide field-of-view (FoV), enabling machines to perceive 360° 3D surroundings. However, the existing solutions require expensive dense depth labels for supervision, making them impractical in real-world applications. In this paper, we propose the first unsupervised omnidirectional MVS framework based on multiple fisheye images. To this end, we project all images to a virtual view center and composite two panoramic images with spherical geometry from two pairs of back-to-back fisheye images. The two 360° images formulate a stereo pair with a special pose, and the photometric consistency is leveraged to establish the unsupervised constraint, which we term "Pseudo-Stereo Supervision". In addition, we propose Un-OmniMVS, an efficient unsupervised omnidirectional MVS network, to facilitate the inference speed with two efficient components. First, a novel feature extractor with frequency attention is proposed to simultaneously capture the non-local Fourier features and local spatial features, explicitly facilitating the feature representation. Then, a variance-based light cost volume is put forward to reduce the computational complexity. Experiments exhibit that the performance of our unsupervised solution is competitive to that of the state-of-the-art (SoTA) supervised methods with better generalization in real-world data.
CVMar 3
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene EditingJiyuan Wang, Chunyu Lin, Lei Sun et al.
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
CVMar 11, 2024Code
Eliminating Warping Shakes for Unsupervised Online Video StitchingLang Nie, Chunyu Lin, Kang Liao et al.
In this paper, we retarget video stitching to an emerging issue, named warping shake, when extending image stitching to video stitching. It unveils the temporal instability of warped content in non-overlapping regions, despite image stitching having endeavored to preserve the natural structures. Therefore, in most cases, even if the input videos to be stitched are stable, the stitched video will inevitably cause undesired warping shakes and affect the visual experience. To eliminate the shakes, we propose StabStitch to simultaneously realize video stitching and video stabilization in a unified unsupervised learning framework. Starting from the camera paths in video stabilization, we first derive the expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Then a warp smoothing model is presented to optimize them with a comprehensive consideration regarding content alignment, trajectory smoothness, spatial consistency, and online collaboration. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Compared with existing stitching solutions, StabStitch exhibits significant superiority in scene robustness and inference speed in addition to stitching and stabilization performance, contributing to a robust and real-time online video stitching system. The code and dataset are available at https://github.com/nie-lang/StabStitch.
CVApr 15, 2024Code
Digging into contrastive learning for robust depth estimation with diffusion modelsJiyuan Wang, Chunyu Lin, Lang Nie et al.
Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.
CVDec 26, 2023Code
360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency PerceptionZhijie Shen, Chunyu Lin, Junsong Zhang et al.
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results as the compression process often muddles the semantics between various planes. Besides, these data-driven approaches impose an urgent demand for massive data annotations, which are laborious and time-consuming. For the first problem, we propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics. DOPNet consists of three modules that are integrated to deliver distortion-free, semantics-clean, and detail-sharp disentangled representations, which benefit the subsequent layout recovery. For the second problem, we present an unsupervised adaptation technique tailored for horizon-depth and ratio representations. Concretely, we introduce an optimization strategy for decision-level layout analysis and a 1D cost volume construction method for feature-level multi-view aggregation, both of which are designed to fully exploit the geometric consistency across multiple perspectives. The optimizer provides a reliable set of pseudo-labels for network training, while the 1D cost volume enriches each view with comprehensive scene information derived from other perspectives. Extensive experiments demonstrate that our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks. Cobe can be available at https://github.com/zhijieshen-bjtu/MV-DOPNet.
CVSep 16, 2025Code
Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised BaselineZhihao Zhang, Chunyu Lin, Lang Nie et al.
As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird's-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.
CVAug 7, 2025Code
Robust Image Stitching with Optimal PlaneLang Nie, Yuan Mei, Kang Liao et al.
We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.
CVApr 12, 2025Code
You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian SplattingZhijie Shen, Chunyu Lin, Shujuan Huang et al.
Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.
CVJan 24, 2024Code
Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and BeyondLang Nie, Chunyu Lin, Kang Liao et al.
Thin-plate spline (TPS) is a principal warp that allows for representing elastic, nonlinear transformation with control point motions. With the increase of control points, the warp becomes increasingly flexible but usually encounters a bottleneck caused by undesired issues, e.g., content distortion. In this paper, we explore generic applications of TPS in single-image-based warping tasks, such as rotation correction, rectangling, and portrait correction. To break this bottleneck, we propose the coupled thin-plate spline model (CoupledTPS), which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. Concretely, we first design an iterative search to predict new control points according to the current latent condition. Then, we present the warping flow as a bridge for the coupling of different TPS transformations, effectively eliminating interpolation errors caused by multiple warps. Besides, in light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. It is formulated through dual transformation between the searched control points of unlabeled data and its graphic augmentation, yielding an implicit correction consistency constraint. Finally, we collect massive unlabeled data to exhibit the benefit of our semi-supervised scheme in rotation correction. Extensive experiments demonstrate the superiority and universality of CoupledTPS over the existing state-of-the-art (SoTA) solutions for rotation correction and beyond. The code and data are available at https://github.com/nie-lang/CoupledTPS.
CVJul 6, 2021Code
Depth-Aware Multi-Grid Deep Homography Estimation with Contextual CorrelationLang Nie, Chunyu Lin, Kang Liao et al.
Homography estimation is an important task in computer vision applications, such as image stitching, video stabilization, and camera calibration. Traditional homography estimation methods heavily depend on the quantity and distribution of feature correspondences, leading to poor robustness in low-texture scenes. The learning solutions, on the contrary, try to learn robust deep features but demonstrate unsatisfying performance in the scenes with low overlap rates. In this paper, we address these two problems simultaneously by designing a contextual correlation layer (CCL). The CCL can efficiently capture the long-range correlation within feature maps and can be flexibly used in a learning framework. In addition, considering that a single homography can not represent the complex spatial transformation in depth-varying images with parallax, we propose to predict multi-grid homography from global to local. Moreover, we equip our network with a depth perception capability, by introducing a novel depth-aware shape-preserved loss. Extensive experiments demonstrate the superiority of our method over state-of-the-art solutions in the synthetic benchmark dataset and real-world dataset. The codes and models will be available at https://github.com/nie-lang/Multi-Grid-Deep-Homography.
63.8CVMar 11
UniStitch: Unifying Semantic and Geometric Features for Image StitchingYuan Mei, Lang Nie, Kang Liao et al.
Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.
CVMar 20, 2025
Jasmine: Harnessing Diffusion Prior for Self-supervised Depth EstimationJiyuan Wang, Chunyu Lin, Cheng Guan et al.
In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
CVMay 8, 2025
StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional WarpsLang Nie, Chunyu Lin, Kang Liao et al.
We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.
CVApr 23, 2024
SGFormer: Spherical Geometry Transformer for 360 Depth EstimationJunsong Zhang, Zisong Chen, Chunyu Lin et al.
Panoramic distortion poses a significant challenge in 360 depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, which can result in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar re-projection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions.
CVSep 4, 2025
From Editor to Dense Geometry EstimatorJiYuan Wang, Chunyu Lin, Lei Sun et al.
Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.
CVMay 10, 2025
StableMotion: Repurposing Diffusion-Based Image Priors for Motion EstimationZiyi Wang, Haipeng Li, Lin Sui et al.
We present StableMotion, a novel framework leverages knowledge (geometry and content priors) from pretrained large-scale image diffusion models to perform motion estimation, solving single-image-based image rectification tasks such as Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC). Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD) models as backbone and repurposes it into an image-to-motion estimator. To mitigate inconsistent output produced by diffusion models, we propose Adaptive Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive, high-fidelity result. Additionally, we present the concept of Sampling Steps Disaster (SSD), the counterintuitive scenario where increasing the number of sampling steps can lead to poorer outcomes, which enables our framework to achieve one-step inference. StableMotion is verified on two image rectification tasks and delivers state-of-the-art performance in both, as well as showing strong generalizability. Supported by SSD, StableMotion offers a speedup of 200 times compared to previous diffusion model-based methods.
CVApr 1, 2025
Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal AdaptationWenbo Nie, Lang Nie, Chunyu Lin et al.
Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose a structure-to-detail portrait correction model named ImagePC. It integrates the long-range awareness of the transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePC for unlabeled wide-angle videos (termed VideoPC), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePC, VideoPC maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially in blind scenarios. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in the number of people, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.
CVMar 3, 2025
Semi-Supervised 360 Layout Estimation with Panoramic Collaborative PerturbationsJunsong Zhang, Chunyu Lin, Zhijie Shen et al.
The performance of existing supervised layout estimation methods heavily relies on the quality of data annotations. However, obtaining large-scale and high-quality datasets remains a laborious and time-consuming challenge. To solve this problem, semi-supervised approaches are introduced to relieve the demand for expensive data annotations by encouraging the consistent results of unlabeled data with different perturbations. However, existing solutions merely employ vanilla perturbations, ignoring the characteristics of panoramic layout estimation. In contrast, we propose a novel semi-supervised method named SemiLayout360, which incorporates the priors of the panoramic layout and distortion through collaborative perturbations. Specifically, we leverage the panoramic layout prior to enhance the model's focus on potential layout boundaries. Meanwhile, we introduce the panoramic distortion prior to strengthen distortion awareness. Furthermore, to prevent intense perturbations from hindering model convergence and ensure the effectiveness of prior-based perturbations, we divide and reorganize them as panoramic collaborative perturbations. Our experimental results on three mainstream benchmarks demonstrate that the proposed method offers significant advantages over existing state-of-the-art (SoTA) solutions.
CVDec 8, 2021
FPPN: Future Pseudo-LiDAR Frame Prediction for Autonomous DrivingXudong Huang, Chunyu Lin, Haojie Liu et al.
LiDAR sensors are widely used in autonomous driving due to the reliable 3D spatial information. However, the data of LiDAR is sparse and the frequency of LiDAR is lower than that of cameras. To generate denser point clouds spatially and temporally, we propose the first future pseudo-LiDAR frame prediction network. Given the consecutive sparse depth maps and RGB images, we first predict a future dense depth map based on dynamic motion information coarsely. To eliminate the errors of optical flow estimation, an inter-frame aggregation module is proposed to fuse the warped depth maps with adaptive weights. Then, we refine the predicted dense depth map using static contextual information. The future pseudo-LiDAR frame can be obtained by converting the predicted dense depth map into corresponding 3D point clouds. Experimental results show that our method outperforms the existing solutions on the popular KITTI benchmark.
CVJun 24, 2021
Unsupervised Deep Image Stitching: Reconstructing Stitched Features to ImagesLang Nie, Chunyu Lin, Kang Liao et al.
Traditional feature-based image stitching technologies rely heavily on feature detection quality, often failing to stitch images with few features or low resolution. The learning-based image stitching solutions are rarely studied due to the lack of labeled data, making the supervised methods unreliable. To address the above limitations, we propose an unsupervised deep image stitching framework consisting of two stages: unsupervised coarse image alignment and unsupervised image reconstruction. In the first stage, we design an ablation-based loss to constrain an unsupervised homography network, which is more suitable for large-baseline scenes. Moreover, a transformer layer is introduced to warp the input images in the stitching-domain space. In the second stage, motivated by the insight that the misalignments in pixel-level can be eliminated to a certain extent in feature-level, we design an unsupervised image reconstruction network to eliminate the artifacts from features to pixels. Specifically, the reconstruction network can be implemented by a low-resolution deformation branch and a high-resolution refined branch, learning the deformation rules of image stitching and enhancing the resolution simultaneously. To establish an evaluation benchmark and train the learning framework, a comprehensive real-world image dataset for unsupervised deep image stitching is presented and released. Extensive experiments well demonstrate the superiority of our method over other state-of-the-art solutions. Even compared with the supervised solutions, our image stitching quality is still preferred by users.
CVDec 11, 2020
Learning Edge-Preserved Image Stitching from Large-Baseline Deep HomographyLang Nie, Chunyu Lin, Kang Liao et al.
Image stitching is a classical and crucial technique in computer vision, which aims to generate the image with a wide field of view. The traditional methods heavily depend on the feature detection and require that scene features be dense and evenly distributed in the image, leading to varying ghosting effects and poor robustness. Learning methods usually suffer from fixed view and input size limitations, showing a lack of generalization ability on other real datasets. In this paper, we propose an image stitching learning framework, which consists of a large-baseline deep homography module and an edge-preserved deformation module. First, we propose a large-baseline deep homography module to estimate the accurate projective transformation between the reference image and the target image in different scales of features. After that, an edge-preserved deformation module is designed to learn the deformation rules of image stitching from edge to content, eliminating the ghosting effects as much as possible. In particular, the proposed learning framework can stitch images of arbitrary views and input sizes, thus contribute to a supervised deep image stitching method with excellent generalization capability in other real images. Experimental results demonstrate that our homography module significantly outperforms the existing deep homography methods in the large baseline scenes. In image stitching, our method is superior to the existing learning method and shows competitive performance with state-of-the-art traditional methods.