CVNov 10, 2022
Learning Cross-view Geo-localization Embeddings via Dynamic Weighted Decorrelation RegularizationTingyu Wang, Zhedong Zheng, Zunjie Zhu et al.
Cross-view geo-localization aims to spot images of the same location shot from two platforms, e.g., the drone platform and the satellite platform. Existing methods usually focus on optimizing the distance between one embedding with others in the feature space, while neglecting the redundancy of the embedding itself. In this paper, we argue that the low redundancy is also of importance, which motivates the model to mine more diverse patterns. To verify this point, we introduce a simple yet effective regularization, i.e., Dynamic Weighted Decorrelation Regularization (DWDR), to explicitly encourage networks to learn independent embedding channels. As the name implies, DWDR regresses the embedding correlation coefficient matrix to a sparse matrix, i.e., the identity matrix, with dynamic weights. The dynamic weights are applied to focus on still correlated channels during training. Besides, we propose a cross-view symmetric sampling strategy, which keeps the example balance between different platforms. Albeit simple, the proposed method has achieved competitive results on three large-scale benchmarks, i.e., University-1652, CVUSA and CVACT. Moreover, under the harsh circumstance, e.g., the extremely short feature of 64 dimensions, the proposed method surpasses the baseline model by a clear margin.
CVSep 11, 2024
ThermalGaussian: Thermal 3D Gaussian SplattingRongfeng Lu, Hangyu Chen, Zunjie Zhu et al.
Thermography is especially valuable for the military and other users of surveillance cameras. Some recent methods based on Neural Radiance Fields (NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS) prevails due to its rapid training and real-time rendering. In this work, we propose ThermalGaussian, the first thermal 3DGS approach capable of rendering high-quality images in RGB and thermal modalities. We first calibrate the RGB camera and the thermal camera to ensure that both modalities are accurately aligned. Subsequently, we use the registered images to learn the multimodal 3D Gaussians. To prevent the overfitting of any single modality, we introduce several multimodal regularization constraints. We also develop smoothing constraints tailored to the physical characteristics of the thermal modality. Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a hand-hold thermal-infrared camera, facilitating future research on thermal scene reconstruction. We conduct comprehensive experiments to show that ThermalGaussian achieves photorealistic rendering of thermal images and improves the rendering quality of RGB images. With the proposed multimodal regularization constraints, we also reduced the model's storage cost by 90%. Our project page is at https://thermalgaussian.github.io/.
CVJan 4Code
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous ParkingXiaobao Wei, Zhangjie Ye, Yuxiang Gu et al.
Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: https://github.com/wm-research/ParkGaussian
CVMay 17, 2023Code
Rethinking Boundary Discontinuity Problem for Oriented Object DetectionHang Xu, Xinyuan Liu, Haonan Xu et al.
Oriented object detection has been developed rapidly in the past few years, where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate, but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle, which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary, and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However, we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis, we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods, the model essentially attempts to fit the angular relationship between box and object, where the break point at angular boundary makes the predictions highly unstable.To deal with this issue, we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities, which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters.Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover, typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm.
CVMar 20, 2025
4D Gaussian Splatting SLAMYanyan Li, Youxu Fang, Zunjie Zhu et al.
Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.
CVJul 24, 2025
DepthDark: Robust Monocular Depth Estimation for Low-Light EnvironmentsLongjian Zeng, Zunjie Zhu, Rongfeng Lu et al.
In recent years, foundation models for monocular depth estimation have received increasing attention. Current methods mainly address typical daylight conditions, but their effectiveness notably decreases in low-light environments. There is a lack of robust foundational models for monocular depth estimation specifically designed for low-light scenarios. This largely stems from the absence of large-scale, high-quality paired depth datasets for low-light conditions and the effective parameter-efficient fine-tuning (PEFT) strategy. To address these challenges, we propose DepthDark, a robust foundation model for low-light monocular depth estimation. We first introduce a flare-simulation module and a noise-simulation module to accurately simulate the imaging process under nighttime conditions, producing high-quality paired depth datasets for low-light conditions. Additionally, we present an effective low-light PEFT strategy that utilizes illumination guidance and multiscale feature fusion to enhance the model's capability in low-light environments. Our method achieves state-of-the-art depth estimation performance on the challenging nuScenes-Night and RobotCar-Night datasets, validating its effectiveness using limited training data and computing resources.
CVOct 19, 2025
2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian SplattingHaofan Ren, Qingsong Yan, Ming Lu et al.
Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.
CVJul 9, 2025
Capturing Stable HDR Videos Using a Dual-Camera SystemQianyu Zhang, Bolun Zheng, Lingyu Zhu et al.
High Dynamic Range (HDR) video acquisition using the alternating exposure (AE) paradigm has garnered significant attention due to its cost-effectiveness with a single consumer camera. However, despite progress driven by deep neural networks, these methods remain prone to temporal flicker in real-world applications due to inter-frame exposure inconsistencies. To address this challenge while maintaining the cost-effectiveness of the AE paradigm, we propose a novel learning-based HDR video generation solution. Specifically, we propose a dual-stream HDR video generation paradigm that decouples temporal luminance anchoring from exposure-variant detail reconstruction, overcoming the inherent limitations of the AE paradigm. To support this, we design an asynchronous dual-camera system (DCS), which enables independent exposure control across two cameras, eliminating the need for synchronization typically required in traditional multi-camera setups. Furthermore, an exposure-adaptive fusion network (EAFNet) is formulated for the DCS system. EAFNet integrates a pre-alignment subnetwork that aligns features across varying exposures, ensuring robust feature extraction for subsequent fusion, an asymmetric cross-feature fusion subnetwork that emphasizes reference-based attention to effectively merge these features across exposures, and a reconstruction subnetwork to mitigate ghosting artifacts and preserve fine details. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance across various datasets, showing the remarkable potential of our solution in HDR video reconstruction. The codes and data captured by DCS will be available at https://zqqqyu.github.io/DCS-HDR/.
CVMay 26, 2025
K-Buffers: A Plug-in Method for Enhancing Neural Fields with Multiple BuffersHaofan Ren, Zunjie Zhu, Xiang Chen et al.
Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.
CVMay 8, 2025
Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU SensorsZunjie Zhu, Yan Zhao, Yihan Hu et al.
The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.
CVDec 7, 2018
Real-time Indoor Scene Reconstruction with RGBD and Inertia InputZunjie Zhu, Feng Xu
Camera motion estimation is a key technique for 3D scene reconstruction and Simultaneous localization and mapping (SLAM). To make it be feasibly achieved, previous works usually assume slow camera motions, which limits its usage in many real cases. We propose an end-to-end 3D reconstruction system which combines color, depth and inertial measurements to achieve robust reconstruction with fast sensor motions. Our framework extends Kalman filter to fuse the three kinds of information and involve an iterative method to jointly optimize feature correspondences, camera poses and scene geometry. We also propose a novel geometry-aware patch deformation technique to adapt the feature appearance in image domain, leading to a more accurate feature matching under fast camera motions. Experiments show that our patch deformation method improves the accuracy of feature tracking, and our 3D reconstruction outperforms the state-of-the-art solutions under fast camera motions.