Chunyu Lin

CV
h-index18
55papers
1,510citations
Novelty53%
AI Score61

55 Papers

CVFeb 16, 2023Code
Parallax-Tolerant Unsupervised Deep Image Stitching

Lang Nie, Chunyu Lin, Kang Liao et al.

Traditional image stitching approaches tend to leverage increasingly complex geometric features (point, line, edge, etc.) for better performance. However, these hand-crafted features are only suitable for specific natural scenes with adequate geometric structures. In contrast, deep stitching schemes overcome the adverse conditions by adaptively learning robust semantic features, but they cannot handle large-parallax cases due to homography-based registration. To solve these issues, we propose UDIS++, a parallax-tolerant unsupervised deep image stitching technique. First, we propose a robust and flexible warp to model the image registration from global homography to local thin-plate spline motion. It provides accurate alignment for overlapping regions and shape preservation for non-overlapping regions by joint optimization concerning alignment and distortion. Subsequently, to improve the generalization capability, we design a simple but effective iterative strategy to enhance the warp adaption in cross-dataset and cross-resolution applications. Finally, to further eliminate the parallax artifacts, we propose to composite the stitched image seamlessly by unsupervised learning for seam-driven composition masks. Compared with existing methods, our solution is parallax-tolerant and free from laborious designs of complicated geometric features for specific scenes. Extensive experiments show our superiority over the SoTA methods, both quantitatively and qualitatively. The code is available at https://github.com/nie-lang/UDIS2.

CVSep 28, 2023Code
GAFlow: Incorporating Gaussian Attention into Optical Flow

Ao Luo, Fan Yang, Xin Li et al.

Optical flow, or the estimation of motion fields from image sequences, is one of the fundamental problems in computer vision. Unlike most pixel-wise tasks that aim at achieving consistent representations of the same category, optical flow raises extra demands for obtaining local discrimination and smoothness, which yet is not fully explored by existing approaches. In this paper, we push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning and enforce the motion affinity during matching. Specifically, we introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks to highlight the local neighborhood that contains fine-grained structural information. Moreover, for reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM) which not only inherits properties from Gaussian distribution to instinctively revolve around the neighbor fields of each point but also is empowered to put the emphasis on contextually related regions during matching. Our fully-equipped model, namely Gaussian Attention Flow network (GAFlow), naturally incorporates a series of novel Gaussian-based modules into the conventional optical flow framework for reliable motion analysis. Extensive experiments on standard optical flow datasets consistently demonstrate the exceptional performance of the proposed approach in terms of both generalization ability evaluation and online benchmark testing. Code is available at https://github.com/LA30/GAFlow.

CVMay 8, 2022Code
Unsupervised Homography Estimation with Coplanarity-Aware GAN

Mingbo Hong, Yuhang Lu, Nianjin Ye et al.

Estimating homography from an image pair is a fundamental problem in image alignment. Unsupervised learning methods have received increasing attention in this field due to their promising performance and label-free training. However, existing methods do not explicitly consider the problem of plane-induced parallax, which will make the predicted homography compromised on multiple planes. In this work, we propose a novel method HomoGAN to guide unsupervised homography estimation to focus on the dominant plane. First, a multi-scale transformer network is designed to predict homography from the feature pyramids of input images in a coarse-to-fine fashion. Moreover, we propose an unsupervised GAN to impose coplanarity constraint on the predicted homography, which is realized by using a generator to predict a mask of aligned regions, and then a discriminator to check if two masked feature maps are induced by a single homography. To validate the effectiveness of HomoGAN and its components, we conduct extensive experiments on a large-scale dataset, and the results show that our matching error is 22% lower than the previous SOTA method. Code is available at https://github.com/megvii-research/HomoGAN.

CVApr 18, 2022Code
Cylin-Painting: Seamless {360\textdegree} Panoramic Image Outpainting and Beyond

Kang Liao, Xiangyu Xu, Chunyu Lin et al.

Image outpainting gains increasing attention since it can generate the complete scene from a partial view, providing a valuable solution to construct {360\textdegree} panoramic images. As image outpainting suffers from the intrinsic issue of unidirectional completion flow, previous methods convert the original problem into inpainting, which allows a bidirectional flow. However, we find that inpainting has its own limitations and is inferior to outpainting in certain situations. The question of how they may be combined for the best of both has as yet remained under-explored. In this paper, we provide a deep analysis of the differences between inpainting and outpainting, which essentially depends on how the source pixels contribute to the unknown regions under different spatial arrangements. Motivated by this analysis, we present a Cylin-Painting framework that involves meaningful collaborations between inpainting and outpainting and efficiently fuses the different arrangements, with a view to leveraging their complementary benefits on a seamless cylinder. Nevertheless, straightforwardly applying the cylinder-style convolution often generates visually unpleasing results as it discards important positional information. To address this issue, we further present a learnable positional embedding strategy to incorporate the missing component of positional encoding into the cylinder convolution, which significantly improves the panoramic results. It is noted that while developed for image outpainting, the proposed algorithm can be effectively extended to other panoramic vision tasks, such as object detection, depth estimation, and image super-resolution. Code will be made available at \url{https://github.com/KangLiao929/Cylin-Painting}.

CVMar 19, 2023Code
Deep Learning for Camera Calibration and Beyond: A Survey

Kang Liao, Lang Nie, Shujuan Huang et al.

Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations. Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model, following the research trend and extended applications. As there is no unified benchmark in this community, we collect a holistic calibration dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration (spanned 10 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.

CVJul 7, 2022Code
Deep Rotation Correction without Angle Prior

Lang Nie, Chunyu Lin, Kang Liao et al.

Not everybody can be equipped with professional photography skills and sufficient shooting time, and there can be some tilts in the captured images occasionally. In this paper, we propose a new and practical task, named Rotation Correction, to automatically correct the tilt with high content fidelity in the condition that the rotated angle is unknown. This task can be easily integrated into image editing applications, allowing users to correct the rotated images without any manual operations. To this end, we leverage a neural network to predict the optical flows that can warp the tilted images to be perceptually horizontal. Nevertheless, the pixel-wise optical flow estimation from a single image is severely unstable, especially in large-angle tilted images. To enhance its robustness, we propose a simple but effective prediction strategy to form a robust elastic warp. Particularly, we first regress the mesh deformation that can be transformed into robust initial optical flows. Then we estimate residual optical flows to facilitate our network the flexibility of pixel-wise deformation, further correcting the details of the tilted images. To establish an evaluation benchmark and train the learning framework, a comprehensive rotation correction dataset is presented with a large diversity in scenes and rotated angles. Extensive experiments demonstrate that even in the absence of the angle prior, our algorithm can outperform other state-of-the-art solutions requiring this prior. The code and dataset are available at https://github.com/nie-lang/RotationCorrection.

CVOct 9, 2023Code
WeatherDepth: Curriculum Contrastive Learning for Self-Supervised Depth Estimation under Adverse Weather Conditions

Jiyuan Wang, Chunyu Lin, Lang Nie et al.

Depth estimation models have shown promising performance on clear scenes but fail to generalize to adverse weather conditions due to illumination variations, weather particles, etc. In this paper, we propose WeatherDepth, a self-supervised robust depth estimation model with curriculum contrastive learning, to tackle performance degradation in complex weather conditions. Concretely, we first present a progressive curriculum learning scheme with three simple-to-complex curricula to gradually adapt the model from clear to relative adverse, and then to adverse weather scenes. It encourages the model to gradually grasp beneficial depth cues against the weather effect, yielding smoother and better domain adaption. Meanwhile, to prevent the model from forgetting previous curricula, we integrate contrastive learning into different curricula. By drawing reference knowledge from the previous course, our strategy establishes a depth consistency constraint between different courses toward robust depth estimation in diverse weather. Besides, to reduce manual intervention and better adapt to different models, we designed an adaptive curriculum scheduler to automatically search for the best timing for course switching. In the experiment, the proposed solution is proven to be easily incorporated into various architectures and demonstrates state-of-the-art (SoTA) performance on both synthetic and real weather datasets. Source code and data are available at \url{https://github.com/wangjiyuan9/WeatherDepth}.

CVMar 2, 2023Code
Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

Zhijie Shen, Zishuo Zheng, Chunyu Lin et al.

Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at \url{https://github.com/zhijieshen-bjtu/DOPNet}.

CVJan 4, 2023Code
RecRecNet: Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model and DoF-based Curriculum Learning

Kang Liao, Lang Nie, Chunyu Lin et al.

The wide-angle lens shows appealing applications in VR technologies, but it introduces severe radial distortion into its captured image. To recover the realistic scene, previous works devote to rectifying the content of the wide-angle image. However, such a rectification solution inevitably distorts the image boundary, which changes related geometric distributions and misleads the current vision perception models. In this work, we explore constructing a win-win representation on both content and boundary by contributing a new learning model, i.e., Rectangling Rectification Network (RecRecNet). In particular, we propose a thin-plate spline (TPS) module to formulate the non-linear and non-rigid transformation for rectangling images. By learning the control points on the rectified image, our model can flexibly warp the source structure to the target domain and achieves an end-to-end unsupervised deformation. To relieve the complexity of structure approximation, we then inspire our RecRecNet to learn the gradual deformation rules with a DoF (Degree of Freedom)-based curriculum learning. By increasing the DoF in each curriculum stage, namely, from similarity transformation (4-DoF) to homography transformation (8-DoF), the network is capable of investigating more detailed deformations, offering fast convergence on the final rectangling task. Experiments show the superiority of our solution over the compared methods on both quantitative and qualitative evaluations. The code and dataset are available at https://github.com/KangLiao929/RecRecNet.

CVAug 3, 2022Code
Neural Contourlet Network for Monocular 360 Depth Estimation

Zhijie Shen, Chunyu Lin, Lang Nie et al.

For a monocular 360 image, depth estimation is a challenging because the distortion increases along the latitude. To perceive the distortion, existing methods devote to designing a deep and complex network architecture. In this paper, we provide a new perspective that constructs an interpretable and sparse representation for a 360 image. Considering the importance of the geometric structure in depth estimation, we utilize the contourlet transform to capture an explicit geometric cue in the spectral domain and integrate it with an implicit cue in the spatial domain. Specifically, we propose a neural contourlet network consisting of a convolutional neural network and a contourlet transform branch. In the encoder stage, we design a spatial-spectral fusion module to effectively fuse two types of cues. Contrary to the encoder, we employ the inverse contourlet transform with learned low-pass subbands and band-pass directional subbands to compose the depth in the decoder. Experiments on the three popular panoramic image datasets demonstrate that the proposed approach outperforms the state-of-the-art schemes with faster convergence. Code is available at https://github.com/zhijieshen-bjtu/Neural-Contourlet-Network-for-MODE.

CVJun 9, 2022Code
JNMR: Joint Non-linear Motion Regression for Video Frame Interpolation

Meiqin Liu, Chenming Xu, Chao Yao et al.

Video frame interpolation (VFI) aims to generate predictive frames by warping learnable motions from the bidirectional historical references. Most existing works utilize spatio-temporal semantic information extractor to realize motion estimation and interpolation modeling. However, they insufficiently consider the real mechanistic rationality of generated middle motions. In this paper, we reformulate VFI as a Joint Non-linear Motion Regression (JNMR) strategy to model the complicated motions of inter-frame. Specifically, the motion trajectory between the target frame and the multiple reference frames is regressed by a temporal concatenation of multi-stage quadratic models. ConvLSTM is adopted to construct this joint distribution of complete motions in temporal dimension. Moreover, the feature learning network is designed to optimize for the joint regression modeling. A coarse-to-fine synthesis enhancement module is also conducted to learn visual dynamics at different resolutions through repetitive regression and interpolation. Experimental results on VFI show that the effectiveness and significant improvement of joint motion regression compared with the state-of-the-art methods. The code is available at https://github.com/ruhig6/JNMR.

CVMar 8, 2022
Deep Rectangling for Image Stitching: A Learning Baseline

Lang Nie, Chunyu Lin, Kang Liao et al.

Stitched images provide a wide field-of-view (FoV) but suffer from unpleasant irregular boundaries. To deal with this problem, existing image rectangling methods devote to searching an initial mesh and optimizing a target mesh to form the mesh deformation in two stages. Then rectangular images can be generated by warping stitched images. However, these solutions only work for images with rich linear structures, leading to noticeable distortions for portraits and landscapes with non-linear objects. In this paper, we address these issues by proposing the first deep learning solution to image rectangling. Concretely, we predefine a rigid target mesh and only estimate an initial mesh to form the mesh deformation, contributing to a compact one-stage solution. The initial mesh is predicted using a fully convolutional network with a residual progressive regression strategy. To obtain results with high content fidelity, a comprehensive objective function is proposed to simultaneously encourage the boundary rectangular, mesh shape-preserving, and content perceptually natural. Besides, we build the first image stitching rectangling dataset with a large diversity in irregular boundaries and scenes. Experiments demonstrate our superiority over traditional methods both quantitatively and qualitatively.

CVAug 29, 2024Code
Revisiting 360 Depth Estimation with PanoGabor: A New Fusion Perspective

Zhijie Shen, Chunyu Lin, Lang Nie et al.

Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations ({e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse

CVJun 12, 2022
FisheyeEX: Polar Outpainting for Extending the FoV of Fisheye Lens

Kang Liao, Chunyu Lin, Yunchao Wei et al.

Fisheye lens gains increasing applications in computational photography and assisted driving because of its wide field of view (FoV). However, the fisheye image generally contains invalid black regions induced by its imaging model. In this paper, we present a FisheyeEX method that extends the FoV of the fisheye lens by outpainting the invalid regions, improving the integrity of captured scenes. Compared with the rectangle and undistorted image, there are two challenges for fisheye image outpainting: irregular painting regions and distortion synthesis. Observing the radial symmetry of the fisheye image, we first propose a polar outpainting strategy to extrapolate the coherent semantics from the center to the outside region. Such an outpainting manner considers the distribution pattern of radial distortion and the circle boundary, boosting a more reasonable completion direction. For the distortion synthesis, we propose a spiral distortion-aware perception module, in which the learning path keeps consistent with the distortion prior of the fisheye image. Subsequently, a scene revision module rearranges the generated pixels with the estimated distortion to match the fisheye image, thus extending the FoV. In the experiment, we evaluate the proposed FisheyeEX on three popular outdoor datasets: Cityscapes, BDD100k, and KITTI, and one real-world fisheye image dataset. The results demonstrate that our approach significantly outperforms the state-of-the-art methods, gaining around 27% more content beyond the original fisheye image.

CVMar 17, 2022
PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation

Zhijie Shen, Chunyu Lin, Kang Liao et al.

Existing panoramic depth estimation methods based on convolutional neural networks (CNNs) focus on removing panoramic distortions, failing to perceive panoramic structures efficiently due to the fixed receptive field in CNNs. This paper proposes the panorama transformer (named PanoFormer) to estimate the depth in panorama images, with tangent patches from spherical domain, learnable token flows, and panorama specific metrics. In particular, we divide patches on the spherical tangent domain into tokens to reduce the negative effect of panoramic distortions. Since the geometric structures are essential for depth estimation, a self-attention module is redesigned with an additional learnable token flow. In addition, considering the characteristic of the spherical domain, we present two panorama-specific metrics to comprehensively evaluate the panoramic depth estimation models' performance. Extensive experiments demonstrate that our approach significantly outperforms the state-of-the-art (SOTA) methods. Furthermore, the proposed method can be effectively extended to solve semantic panorama segmentation, a similar pixel2pixel task. Code will be available.

CVNov 3, 2022
Temporal Consistency Learning of inter-frames for Video Super-Resolution

Meiqin Liu, Shuo Jin, Chao Yao et al.

Video super-resolution (VSR) is a task that aims to reconstruct high-resolution (HR) frames from the low-resolution (LR) reference frame and multiple neighboring frames. The vital operation is to utilize the relative misaligned frames for the current frame reconstruction and preserve the consistency of the results. Existing methods generally explore information propagation and frame alignment to improve the performance of VSR. However, few studies focus on the temporal consistency of inter-frames. In this paper, we propose a Temporal Consistency learning Network (TCNet) for VSR in an end-to-end manner, to enhance the consistency of the reconstructed videos. A spatio-temporal stability module is designed to learn the self-alignment from inter-frames. Especially, the correlative matching is employed to exploit the spatial dependency from each frame to maintain structural stability. Moreover, a self-attention mechanism is utilized to learn the temporal correspondence to implement an adaptive warping operation for temporal consistency among multi-frames. Besides, a hybrid recurrent architecture is designed to leverage short-term and long-term information. We further present a progressive fusion module to perform a multistage fusion of spatio-temporal features. And the final reconstructed frames are refined by these fused features. Objective and subjective results of various experiments demonstrate that TCNet has superior performance on different benchmark datasets, compared to several state-of-the-art methods.

CVJul 6, 2022
Complementary Bi-directional Feature Compression for Indoor 360° Semantic Segmentation with Self-distillation

Zishuo Zheng, Chunyu Lin, Lang Nie et al.

Recently, horizontal representation-based panoramic semantic segmentation approaches outperform projection-based solutions, because the distortions can be effectively removed by compressing the spherical data in the vertical direction. However, these methods ignore the distortion distribution prior and are limited to unbalanced receptive fields, e.g., the receptive fields are sufficient in the vertical direction and insufficient in the horizontal direction. Differently, a vertical representation compressed in another direction can offer implicit distortion prior and enlarge horizontal receptive fields. In this paper, we combine the two different representations and propose a novel 360° semantic segmentation solution from a complementary perspective. Our network comprises three modules: a feature extraction module, a bi-directional compression module, and an ensemble decoding module. First, we extract multi-scale features from a panorama. Then, a bi-directional compression module is designed to compress features into two complementary low-dimensional representations, which provide content perception and distortion prior. Furthermore, to facilitate the fusion of bi-directional features, we design a unique self distillation strategy in the ensemble decoding module to enhance the interaction of different features and further improve the performance. Experimental results show that our approach outperforms the state-of-the-art solutions with at least 10\% improvement on quantitative evaluations while displaying the best performance on visual appearance.

CVMar 18, 2022
Distortion-Tolerant Monocular Depth Estimation On Omnidirectional Images Using Dual-cubemap

Zhijie Shen, Chunyu Lin, Lang Nie et al.

Estimating the depth of omnidirectional images is more challenging than that of normal field-of-view (NFoV) images because the varying distortion can significantly twist an object's shape. The existing methods suffer from troublesome distortion while estimating the depth of omnidirectional images, leading to inferior performance. To reduce the negative impact of the distortion influence, we propose a distortion-tolerant omnidirectional depth estimation algorithm using a dual-cubemap. It comprises two modules: Dual-Cubemap Depth Estimation (DCDE) module and Boundary Revision (BR) module. In DCDE module, we present a rotation-based dual-cubemap model to estimate the accurate NFoV depth, reducing the distortion at the cost of boundary discontinuity on omnidirectional depths. Then a boundary revision module is designed to smooth the discontinuous boundaries, which contributes to the precise and visually continuous omnidirectional depths. Extensive experiments demonstrate the superiority of our method over other state-of-the-art solutions.

CVFeb 20, 2023
Unsupervised OmniMVS: Efficient Omnidirectional Depth Inference via Establishing Pseudo-Stereo Supervision

Zisong Chen, Chunyu Lin, Lang Nie et al.

Omnidirectional multi-view stereo (MVS) vision is attractive for its ultra-wide field-of-view (FoV), enabling machines to perceive 360° 3D surroundings. However, the existing solutions require expensive dense depth labels for supervision, making them impractical in real-world applications. In this paper, we propose the first unsupervised omnidirectional MVS framework based on multiple fisheye images. To this end, we project all images to a virtual view center and composite two panoramic images with spherical geometry from two pairs of back-to-back fisheye images. The two 360° images formulate a stereo pair with a special pose, and the photometric consistency is leveraged to establish the unsupervised constraint, which we term "Pseudo-Stereo Supervision". In addition, we propose Un-OmniMVS, an efficient unsupervised omnidirectional MVS network, to facilitate the inference speed with two efficient components. First, a novel feature extractor with frequency attention is proposed to simultaneously capture the non-local Fourier features and local spatial features, explicitly facilitating the feature representation. Then, a variance-based light cost volume is put forward to reduce the computational complexity. Experiments exhibit that the performance of our unsupervised solution is competitive to that of the state-of-the-art (SoTA) supervised methods with better generalization in real-world data.

CVJan 26, 2023
Dual Diffusion Architecture for Fisheye Image Rectification: Synthetic-to-Real Generalization

Shangrong Yang, Chunyu Lin, Kang Liao et al.

Fisheye image rectification has a long-term unresolved issue with synthetic-to-real generalization. In most previous works, the model trained on the synthetic images obtains unsatisfactory performance on the real-world fisheye image. To this end, we propose a Dual Diffusion Architecture (DDA) for the fisheye rectification with a better generalization ability. The proposed DDA is simultaneously trained with paired synthetic fisheye images and unlabeled real fisheye images. By gradually introducing noises, the synthetic and real fisheye images can eventually develop into a consistent noise distribution, improving the generalization and achieving unlabeled real fisheye correction. The original image serves as the prior guidance in existing DDPMs (Denoising Diffusion Probabilistic Models). However, the non-negligible indeterminate relationship between the prior condition and the target affects the generation performance. Especially in the rectification task, the radial distortion can cause significant artifacts. Therefore, we provide an unsupervised one-pass network that produces a plausible new condition to strengthen guidance. This network can be regarded as an alternate scheme for fast producing reliable results without iterative inference. Compared with the state-of-the-art methods, our approach can reach superior performance in both synthetic and real fisheye image corrections.

CVJul 5, 2022
FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration

Shangrong Yang, Chunyu Lin, Kang Liao et al.

Numerous significant progress on fisheye image rectification has been achieved through CNN. Nevertheless, constrained by a fixed receptive field, the global distribution and the local symmetry of the distortion have not been fully exploited. To leverage these two characteristics, we introduced Fishformer that processes the fisheye image as a sequence to enhance global and local perception. We tuned the Transformer according to the structural properties of fisheye images. First, the uneven distortion distribution in patches generated by the existing square slicing method confuses the network, resulting in difficult training. Therefore, we propose an annulus slicing method to maintain the consistency of the distortion in each patch, thus perceiving the distortion distribution well. Second, we analyze that different distortion parameters have their own efficacy domains. Hence, the perception of the local area is as important as the global, but Transformer has a weakness for local texture perception. Therefore, we propose a novel layer attention mechanism to enhance the local perception and texture transfer. Our network simultaneously implements global perception and focused local perception decided by the different parameters. Extensive experiments demonstrate that our method provides superior performance compared with state-of-the-art methods.

CVMar 3
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun et al.

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

IVMar 16, 2023
SigVIC: Spatial Importance Guided Variable-Rate Image Compression

Jiaming Liang, Meiqin Liu, Chao Yao et al.

Variable-rate mechanism has improved the flexibility and efficiency of learning-based image compression that trains multiple models for different rate-distortion tradeoffs. One of the most common approaches for variable-rate is to channel-wisely or spatial-uniformly scale the internal features. However, the diversity of spatial importance is instructive for bit allocation of image compression. In this paper, we introduce a Spatial Importance Guided Variable-rate Image Compression (SigVIC), in which a spatial gating unit (SGU) is designed for adaptively learning a spatial importance mask. Then, a spatial scaling network (SSN) takes the spatial importance mask to guide the feature scaling and bit allocation for variable-rate. Moreover, to improve the quality of decoded image, Top-K shallow features are selected to refine the decoded features through a shallow feature fusion module (SFFM). Experiments show that our method outperforms other learning-based methods (whether variable-rate or not) and traditional codecs, with storage saving and high flexibility.

57.1CVApr 14
StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

Yinxi He, Kang Liao, Chunyu Lin et al.

This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.

CVAug 28, 2024
Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph

Zherong Zhang, Chunyu Lin, Shujuan Huang et al.

Relative pose estimation is crucial for various computer vision applications, including Robotic and Autonomous Driving. Current methods primarily depend on selecting and matching feature points prone to incorrect matches, leading to poor performance. Consequently, relying solely on point-matching relationships for pose estimation is a huge challenge. To overcome these limitations, we propose a Geometric Correspondence Graph neural network that integrates point features with extra structured line segments. This integration of matched points and line segments further exploits the geometry constraints and enhances model performance across different environments. We employ the Dual-Graph module and Feature Weighted Fusion Module to aggregate geometric and visual features effectively, facilitating complex scene understanding. We demonstrate our approach through extensive experiments on the DeMoN and KITTI Odometry datasets. The results show that our method is competitive with state-of-the-art techniques.

CVFeb 8, 2023
Spatiotemporal Deformation Perception for Fisheye Video Rectification

Shangrong Yang, Chunyu Lin, Kang Liao et al.

Although the distortion correction of fisheye images has been extensively studied, the correction of fisheye videos is still an elusive challenge. For different frames of the fisheye video, the existing image correction methods ignore the correlation of sequences, resulting in temporal jitter in the corrected video. To solve this problem, we propose a temporal weighting scheme to get a plausible global optical flow, which mitigates the jitter effect by progressively reducing the weight of frames. Subsequently, we observe that the inter-frame optical flow of the video is facilitated to perceive the local spatial deformation of the fisheye video. Therefore, we derive the spatial deformation through the flows of fisheye and distorted-free videos, thereby enhancing the local accuracy of the predicted result. However, the independent correction for each frame disrupts the temporal correlation. Due to the property of fisheye video, a distorted moving object may be able to find its distorted-free pattern at another moment. To this end, a temporal deformation aggregator is designed to reconstruct the deformation correlation between frames and provide a reliable global feature. Our method achieves an end-to-end correction and demonstrates superiority in correction quality and stability compared with the SOTA correction methods.

CVApr 12, 2022
Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance

Lei Zhang, Kang Liao, Chunyu Lin et al.

Image outpainting technology generates visually plausible content regardless of authenticity, making it unreliable to be applied in practice. Thus, we propose a reliable image outpainting task, introducing the sparse depth from LiDARs to extrapolate authentic RGB scenes. The large field view of LiDARs allows it to serve for data enhancement and further multimodal tasks. Concretely, we propose a Depth-Guided Outpainting Network to model different feature representations of two modalities and learn the structure-aware cross-modal fusion. And two components are designed: 1) The Multimodal Learning Module produces unique depth and RGB feature representations from the perspectives of different modal characteristics. 2) The Depth Guidance Fusion Module leverages the complete depth modality to guide the establishment of RGB contents by progressive multimodal feature fusion. Furthermore, we specially design an additional constraint strategy consisting of Cross-modal Loss and Edge Loss to enhance ambiguous contours and expedite reliable content generation. Extensive experiments on KITTI and Waymo datasets demonstrate our superiority over the state-of-the-art method, quantitatively and qualitatively.

CVMar 11, 2024Code
Eliminating Warping Shakes for Unsupervised Online Video Stitching

Lang Nie, Chunyu Lin, Kang Liao et al.

In this paper, we retarget video stitching to an emerging issue, named warping shake, when extending image stitching to video stitching. It unveils the temporal instability of warped content in non-overlapping regions, despite image stitching having endeavored to preserve the natural structures. Therefore, in most cases, even if the input videos to be stitched are stable, the stitched video will inevitably cause undesired warping shakes and affect the visual experience. To eliminate the shakes, we propose StabStitch to simultaneously realize video stitching and video stabilization in a unified unsupervised learning framework. Starting from the camera paths in video stabilization, we first derive the expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Then a warp smoothing model is presented to optimize them with a comprehensive consideration regarding content alignment, trajectory smoothness, spatial consistency, and online collaboration. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Compared with existing stitching solutions, StabStitch exhibits significant superiority in scene robustness and inference speed in addition to stitching and stabilization performance, contributing to a robust and real-time online video stitching system. The code and dataset are available at https://github.com/nie-lang/StabStitch.

CVApr 15, 2024Code
Digging into contrastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie et al.

Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.

CVDec 26, 2023Code
360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception

Zhijie Shen, Chunyu Lin, Junsong Zhang et al.

Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results as the compression process often muddles the semantics between various planes. Besides, these data-driven approaches impose an urgent demand for massive data annotations, which are laborious and time-consuming. For the first problem, we propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics. DOPNet consists of three modules that are integrated to deliver distortion-free, semantics-clean, and detail-sharp disentangled representations, which benefit the subsequent layout recovery. For the second problem, we present an unsupervised adaptation technique tailored for horizon-depth and ratio representations. Concretely, we introduce an optimization strategy for decision-level layout analysis and a 1D cost volume construction method for feature-level multi-view aggregation, both of which are designed to fully exploit the geometric consistency across multiple perspectives. The optimizer provides a reliable set of pseudo-labels for network training, while the 1D cost volume enriches each view with comprehensive scene information derived from other perspectives. Extensive experiments demonstrate that our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks. Cobe can be available at https://github.com/zhijieshen-bjtu/MV-DOPNet.

95.2CVMay 12
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin et al.

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

CVSep 16, 2025Code
Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline

Zhihao Zhang, Chunyu Lin, Lang Nie et al.

As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird's-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.

CVAug 7, 2025Code
Robust Image Stitching with Optimal Plane

Lang Nie, Yuan Mei, Kang Liao et al.

We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.

CVApr 12, 2025Code
You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting

Zhijie Shen, Chunyu Lin, Shujuan Huang et al.

Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.

CVJan 24, 2024Code
Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and Beyond

Lang Nie, Chunyu Lin, Kang Liao et al.

Thin-plate spline (TPS) is a principal warp that allows for representing elastic, nonlinear transformation with control point motions. With the increase of control points, the warp becomes increasingly flexible but usually encounters a bottleneck caused by undesired issues, e.g., content distortion. In this paper, we explore generic applications of TPS in single-image-based warping tasks, such as rotation correction, rectangling, and portrait correction. To break this bottleneck, we propose the coupled thin-plate spline model (CoupledTPS), which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. Concretely, we first design an iterative search to predict new control points according to the current latent condition. Then, we present the warping flow as a bridge for the coupling of different TPS transformations, effectively eliminating interpolation errors caused by multiple warps. Besides, in light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. It is formulated through dual transformation between the searched control points of unlabeled data and its graphic augmentation, yielding an implicit correction consistency constraint. Finally, we collect massive unlabeled data to exhibit the benefit of our semi-supervised scheme in rotation correction. Extensive experiments demonstrate the superiority and universality of CoupledTPS over the existing state-of-the-art (SoTA) solutions for rotation correction and beyond. The code and data are available at https://github.com/nie-lang/CoupledTPS.

CVJul 6, 2021Code
Depth-Aware Multi-Grid Deep Homography Estimation with Contextual Correlation

Lang Nie, Chunyu Lin, Kang Liao et al.

Homography estimation is an important task in computer vision applications, such as image stitching, video stabilization, and camera calibration. Traditional homography estimation methods heavily depend on the quantity and distribution of feature correspondences, leading to poor robustness in low-texture scenes. The learning solutions, on the contrary, try to learn robust deep features but demonstrate unsatisfying performance in the scenes with low overlap rates. In this paper, we address these two problems simultaneously by designing a contextual correlation layer (CCL). The CCL can efficiently capture the long-range correlation within feature maps and can be flexibly used in a learning framework. In addition, considering that a single homography can not represent the complex spatial transformation in depth-varying images with parallax, we propose to predict multi-grid homography from global to local. Moreover, we equip our network with a depth perception capability, by introducing a novel depth-aware shape-preserved loss. Extensive experiments demonstrate the superiority of our method over state-of-the-art solutions in the synthetic benchmark dataset and real-world dataset. The codes and models will be available at https://github.com/nie-lang/Multi-Grid-Deep-Homography.

CVJul 21, 2020Code
A Deep Ordinal Distortion Estimation Approach for Distortion Rectification

Kang Liao, Chunyu Lin, Yao Zhao

Distortion is widely existed in the images captured by popular wide-angle cameras and fisheye cameras. Despite the long history of distortion rectification, accurately estimating the distortion parameters from a single distorted image is still challenging. The main reason is these parameters are implicit to image features, influencing the networks to fully learn the distortion information. In this work, we propose a novel distortion rectification approach that can obtain more accurate parameters with higher efficiency. Our key insight is that distortion rectification can be cast as a problem of learning an ordinal distortion from a single distorted image. To solve this problem, we design a local-global associated estimation network that learns the ordinal distortion to approximate the realistic distortion distribution. In contrast to the implicit distortion parameters, the proposed ordinal distortion have more explicit relationship with image features, and thus significantly boosts the distortion perception of neural networks. Considering the redundancy of distortion information, our approach only uses a part of distorted image for the ordinal distortion estimation, showing promising applications in the efficient distortion rectification. To our knowledge, we first unify the heterogeneous distortion parameters into a learning-friendly intermediate representation through ordinal distortion, bridging the gap between image feature and distortion rectification. The experimental results demonstrate that our approach outperforms the state-of-the-art methods by a significant margin, with approximately 23% improvement on the quantitative evaluation while displaying the best performance on visual appearance. The code is available at https://github.com/KangLiao929/OrdinalDistortion.

63.8CVMar 11
UniStitch: Unifying Semantic and Geometric Features for Image Stitching

Yuan Mei, Lang Nie, Kang Liao et al.

Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.

CVMar 6
RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan, Chunyu Lin, Zhijie Shen et al.

Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.

CVMar 20, 2025
Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

Jiyuan Wang, Chunyu Lin, Cheng Guan et al.

In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.

CVMay 8, 2025
StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

Lang Nie, Chunyu Lin, Kang Liao et al.

We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.

CVApr 23, 2024
SGFormer: Spherical Geometry Transformer for 360 Depth Estimation

Junsong Zhang, Zisong Chen, Chunyu Lin et al.

Panoramic distortion poses a significant challenge in 360 depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, which can result in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar re-projection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions.

CVSep 4, 2025
From Editor to Dense Geometry Estimator

JiYuan Wang, Chunyu Lin, Lei Sun et al.

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

CVApr 1, 2025
Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal Adaptation

Wenbo Nie, Lang Nie, Chunyu Lin et al.

Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose a structure-to-detail portrait correction model named ImagePC. It integrates the long-range awareness of the transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePC for unlabeled wide-angle videos (termed VideoPC), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePC, VideoPC maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially in blind scenarios. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in the number of people, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.

CVMar 3, 2025
Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations

Junsong Zhang, Chunyu Lin, Zhijie Shen et al.

The performance of existing supervised layout estimation methods heavily relies on the quality of data annotations. However, obtaining large-scale and high-quality datasets remains a laborious and time-consuming challenge. To solve this problem, semi-supervised approaches are introduced to relieve the demand for expensive data annotations by encouraging the consistent results of unlabeled data with different perturbations. However, existing solutions merely employ vanilla perturbations, ignoring the characteristics of panoramic layout estimation. In contrast, we propose a novel semi-supervised method named SemiLayout360, which incorporates the priors of the panoramic layout and distortion through collaborative perturbations. Specifically, we leverage the panoramic layout prior to enhance the model's focus on potential layout boundaries. Meanwhile, we introduce the panoramic distortion prior to strengthen distortion awareness. Furthermore, to prevent intense perturbations from hindering model convergence and ensure the effectiveness of prior-based perturbations, we divide and reorganize them as panoramic collaborative perturbations. Our experimental results on three mainstream benchmarks demonstrate that the proposed method offers significant advantages over existing state-of-the-art (SoTA) solutions.

CVJan 28, 2025
What Really Matters for Learning-based LiDAR-Camera Calibration

Shujuan Huang, Chunyu Lin, Yao Zhao

Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.

CVDec 26, 2024
Revisiting Monocular 3D Object Detection with Depth Thickness Field

Qiude Zhang, Chunyu Lin, Zhijie Shen et al.

Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations (e.g., depth one-hot encoding or depth distribution). To address this issue, we introduce a novel Depth Thickness Field approach to embed clear 3D structures of the scenes. Specifically, we present MonoDTF, a scene-to-instance depth-adapted network for monocular 3D object detection. The framework mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former retargets traditional depth representations to the proposed depth thickness field, incorporating the scene-level perception of 3D structures. The latter refines the voxel space with the guidance of instances, enhancing the 3D instance-aware capability of the depth thickness field and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The code will be available.

CVDec 8, 2021
FPPN: Future Pseudo-LiDAR Frame Prediction for Autonomous Driving

Xudong Huang, Chunyu Lin, Haojie Liu et al.

LiDAR sensors are widely used in autonomous driving due to the reliable 3D spatial information. However, the data of LiDAR is sparse and the frequency of LiDAR is lower than that of cameras. To generate denser point clouds spatially and temporally, we propose the first future pseudo-LiDAR frame prediction network. Given the consecutive sparse depth maps and RGB images, we first predict a future dense depth map based on dynamic motion information coarsely. To eliminate the errors of optical flow estimation, an inter-frame aggregation module is proposed to fuse the warped depth maps with adaptive weights. Then, we refine the predicted dense depth map using static contextual information. The future pseudo-LiDAR frame can be obtained by converting the predicted dense depth map into corresponding 3D point clouds. Experimental results show that our method outperforms the existing solutions on the popular KITTI benchmark.

CVJun 24, 2021
Unsupervised Deep Image Stitching: Reconstructing Stitched Features to Images

Lang Nie, Chunyu Lin, Kang Liao et al.

Traditional feature-based image stitching technologies rely heavily on feature detection quality, often failing to stitch images with few features or low resolution. The learning-based image stitching solutions are rarely studied due to the lack of labeled data, making the supervised methods unreliable. To address the above limitations, we propose an unsupervised deep image stitching framework consisting of two stages: unsupervised coarse image alignment and unsupervised image reconstruction. In the first stage, we design an ablation-based loss to constrain an unsupervised homography network, which is more suitable for large-baseline scenes. Moreover, a transformer layer is introduced to warp the input images in the stitching-domain space. In the second stage, motivated by the insight that the misalignments in pixel-level can be eliminated to a certain extent in feature-level, we design an unsupervised image reconstruction network to eliminate the artifacts from features to pixels. Specifically, the reconstruction network can be implemented by a low-resolution deformation branch and a high-resolution refined branch, learning the deformation rules of image stitching and enhancing the resolution simultaneously. To establish an evaluation benchmark and train the learning framework, a comprehensive real-world image dataset for unsupervised deep image stitching is presented and released. Extensive experiments well demonstrate the superiority of our method over other state-of-the-art solutions. Even compared with the supervised solutions, our image stitching quality is still preferred by users.

CVApr 13, 2021
Towards Fast and Accurate Real-World Depth Super-Resolution: Benchmark Dataset and Baseline

Lingzhi He, Hongguang Zhu, Feng Li et al.

Depth maps obtained by commercial depth sensors are always in low-resolution, making it difficult to be used in various computer vision tasks. Thus, depth map super-resolution (SR) is a practical and valuable task, which upscales the depth map into high-resolution (HR) space. However, limited by the lack of real-world paired low-resolution (LR) and HR depth maps, most existing methods use downsampling to obtain paired training samples. To this end, we first construct a large-scale dataset named "RGB-D-D", which can greatly promote the study of depth map SR and even more depth-related real-world tasks. The "D-D" in our dataset represents the paired LR and HR depth maps captured from mobile phone and Lucid Helios respectively ranging from indoor scenes to challenging outdoor scenes. Besides, we provide a fast depth map super-resolution (FDSR) baseline, in which the high-frequency component adaptively decomposed from RGB image to guide the depth map SR. Extensive experiments on existing public datasets demonstrate the effectiveness and efficiency of our network compared with the state-of-the-art methods. Moreover, for the real-world LR depth maps, our algorithm can produce more accurate HR depth maps with clearer boundaries and to some extent correct the depth value errors.