CVOct 5, 2022Code
TartanCalib: Iterative Wide-Angle Lens Calibration using Adaptive SubPixel Refinement of AprilTagsBardienus P Duisterhof, Yaoyu Hu, Si Heng Teng et al.
Wide-angle cameras are uniquely positioned for mobile robots, by virtue of the rich information they provide in a small, light, and cost-effective form factor. An accurate calibration of the intrinsics and extrinsics is a critical pre-requisite for using the edge of a wide-angle lens for depth perception and odometry. Calibrating wide-angle lenses with current state-of-the-art techniques yields poor results due to extreme distortion at the edge, as most algorithms assume a lens with low to medium distortion closer to a pinhole projection. In this work we present our methodology for accurate wide-angle calibration. Our pipeline generates an intermediate model, and leverages it to iteratively improve feature detection and eventually the camera parameters. We test three key methods to utilize intermediate camera models: (1) undistorting the image into virtual pinhole cameras, (2) reprojecting the target into the image frame, and (3) adaptive subpixel refinement. Combining adaptive subpixel refinement and feature reprojection significantly improves reprojection errors by up to 26.59 %, helps us detect up to 42.01 % more features, and improves performance in the downstream task of dense depth mapping. Finally, TartanCalib is open-source and implemented into an easy-to-use calibration toolbox. We also provide a translation layer with other state-of-the-art works, which allows for regressing generic models with thousands of parameters or using a more robust solver. To this end, TartanCalib is the tool of choice for wide-angle calibration. Project website and code: http://tartancalib.com.
ROSep 18, 2024Code
Bundle Adjustment in the Eager ModeZitong Zhan, Huan Xu, Zihang Fang et al.
Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for \nth{2}-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively. The source code will be available at https://github.com/sair-lab/bae.
CVSep 12, 2024
FIReStereo: Forest InfraRed Stereo Dataset for UAS Depth Perception in Visually Degraded EnvironmentsDevansh Dhrafani, Yifei Liu, Andrew Jong et al.
Robust depth perception in visually-degraded environments is crucial for autonomous aerial systems. Thermal imaging cameras, which capture infrared radiation, are robust to visual degradation. However, due to lack of a large-scale dataset, the use of thermal cameras for unmanned aerial system (UAS) depth perception has remained largely unexplored. This paper presents a stereo thermal depth perception dataset for autonomous aerial perception applications. The dataset consists of stereo thermal images, LiDAR, IMU and ground truth depth maps captured in urban and forest settings under diverse conditions like day, night, rain, and smoke. We benchmark representative stereo depth estimation algorithms, offering insights into their performance in degraded conditions. Models trained on our dataset generalize well to unseen smoky conditions, highlighting the robustness of stereo thermal imaging for depth perception. We aim for this work to enhance robotic perception in disaster scenarios, allowing for exploration and operations in previously unreachable areas. The dataset and source code are available at https://firestereo.github.io.
CVOct 31, 2020Code
TartanVO: A Generalizable Learning-based VOWenshan Wang, Yaoyu Hu, Sebastian Scherer
We present the first learning-based visual odometry (VO) model, which generalizes to multiple datasets and real-world scenarios and outperforms geometry-based methods in challenging scenes. We achieve this by leveraging the SLAM dataset TartanAir, which provides a large amount of diverse synthetic data in challenging environments. Furthermore, to make our VO model generalize across datasets, we propose an up-to-scale loss function and incorporate the camera intrinsic parameters into the model. Experiments show that a single model, TartanVO, trained only on synthetic data, without any finetuning, can be generalized to real-world datasets such as KITTI and EuRoC, demonstrating significant advantages over the geometry-based methods on challenging trajectories. Our code is available at https://github.com/castacks/tartanvo.
CVJun 10, 2025
UFM: A Simple Path towards Unified Dense Correspondence with FlowYuchen Zhang, Nikhil Keetha, Chenwei Lyu et al.
Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.
ROMay 6, 2025
Demonstrating ViSafe: Vision-enabled Safety for High-speed Detect and AvoidParv Kapoor, Ian Higgins, Nikhil Keetha et al.
Assured safe-separation is essential for achieving seamless high-density operation of airborne vehicles in a shared airspace. To equip resource-constrained aerial systems with this safety-critical capability, we present ViSafe, a high-speed vision-only airborne collision avoidance system. ViSafe offers a full-stack solution to the Detect and Avoid (DAA) problem by tightly integrating a learning-based edge-AI framework with a custom multi-camera hardware prototype designed under SWaP-C constraints. By leveraging perceptual input-focused control barrier functions (CBF) to design, encode, and enforce safety thresholds, ViSafe can provide provably safe runtime guarantees for self-separation in high-speed aerial operations. We evaluate ViSafe's performance through an extensive test campaign involving both simulated digital twins and real-world flight scenarios. By independently varying agent types, closure rates, interaction geometries, and environmental conditions (e.g., weather and lighting), we demonstrate that ViSafe consistently ensures self-separation across diverse scenarios. In first-of-its-kind real-world high-speed collision avoidance tests with closure rates reaching 144 km/h, ViSafe sets a new benchmark for vision-only autonomous collision avoidance, establishing a new standard for safety in high-speed aerial navigation.
CVMay 8, 2024
Geometry-Informed Distance Candidate Selection for Adaptive Lightweight Omnidirectional Stereo Vision with Fisheye ImagesConner Pulling, Je Hon Tan, Yaoyu Hu et al.
Multi-view stereo omnidirectional distance estimation usually needs to build a cost volume with many hypothetical distance candidates. The cost volume building process is often computationally heavy considering the limited resources a mobile robot has. We propose a new geometry-informed way of distance candidates selection method which enables the use of a very small number of candidates and reduces the computational cost. We demonstrate the use of the geometry-informed candidates in a set of model variants. We find that by adjusting the candidates during robot deployment, our geometry-informed distance candidates also improve a pre-trained model's accuracy if the extrinsics or the number of cameras changes. Without any re-training or fine-tuning, our models outperform models trained with evenly distributed distance candidates. Models are also released as hardware-accelerated versions with a new dedicated large-scale dataset. The project page, code, and dataset can be found at https://theairlab.org/gicandidates/ .
CVMay 26, 2023
VoxDet: Voxel Learning for Novel Instance DetectionBowen Li, Jiashun Wang, Yaoyu Hu et al.
Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional methodologies, which primarily rely on 2D representations and matching techniques, are often inadequate in handling pose variations and occlusions. To solve this, we introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism. VoxDet first ingeniously proposes template voxel aggregation (TVA) module, effectively transforming multi-view 2D images into 3D voxel features. By leveraging associated camera poses, these features are aggregated into a compact 3D template voxel. In novel instance detection, this voxel representation demonstrates heightened resilience to occlusion and pose variations. We also discover that a 3D reconstruction objective helps to pre-train the 2D-3D mapping in TVA. Second, to quickly align with the template voxel, VoxDet incorporates a Query Voxel Matching (QVM) module. The 2D queries are first converted into their voxel representation with the learned 2D-3D mapping. We find that since the 3D voxel representations encode the geometry, we can first estimate the relative rotation and then compare the aligned voxels, leading to improved accuracy and efficiency. In addition to method, we also introduce the first instance detection benchmark, RoboTools, where 20 unique instances are video-recorded with camera extrinsic. RoboTools also provides 24 challenging cluttered scenarios with more than 9k box annotations. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB-video, and RoboTools benchmarks, where VoxDet outperforms various 2D baselines remarkably with faster speed. To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D novel instance detection tasks.
ROSep 28, 2021
Targetless Extrinsic Calibration of Stereo Cameras, Thermal Cameras, and Laser Sensors in the WildTaimeng Fu, Huai Yu, Wen Yang et al.
The fusion of multi-modal sensors has become increasingly popular in autonomous driving and intelligent robots since it can provide richer information than any single sensor, enhance reliability in complex environments. Multi-sensor extrinsic calibration is one of the key factors of sensor fusion. However, such calibration is difficult due to the variety of sensor modalities and the requirement of calibration targets and human labor. In this paper, we demonstrate a new targetless cross-modal calibration framework by focusing on the extrinsic transformations among stereo cameras, thermal cameras, and laser sensors. Specifically, the calibration between stereo and laser is conducted in 3D space by minimizing the registration error, while the thermal extrinsic to the other two sensors is estimated by optimizing the alignment of the edge features. Our method requires no dedicated targets and performs the multi-sensor calibration in a single shot without human interaction. Experimental results show that the calibration framework is accurate and applicable in general scenes.
ROAug 20, 2021
Unified Representation of Geometric Primitives for Graph-SLAM Optimization Using Decomposed QuadricsWeikun Zhen, Huai Yu, Yaoyu Hu et al.
In Simultaneous Localization And Mapping (SLAM) problems, high-level landmarks have the potential to build compact and informative maps compared to traditional point-based landmarks. In this work, we focus on the parameterization of frequently used geometric primitives including points, lines, planes, ellipsoids, cylinders, and cones. We first present a unified representation based on quadrics, leading to a consistent and concise formulation. Then we further study a decomposed model of quadrics that discloses the symmetric and degenerated properties of a primitive. Based on the decomposition, we develop geometrically meaningful quadrics factors in the settings of a graph-SLAM problem. Then in simulation experiments, it is shown that the decomposed formulation has better efficiency and robustness to observation noises than baseline parameterizations. Finally, in real-world experiments, the proposed back-end framework is demonstrated to be capable of building compact and regularized maps.
CVMar 13, 2021
ORStereo: Occlusion-Aware Recurrent Stereo Matching for 4K-Resolution ImagesYaoyu Hu, Wenshan Wang, Huai Yu et al.
Stereo reconstruction models trained on small images do not generalize well to high-resolution data. Training a model on high-resolution image size faces difficulties of data availability and is often infeasible due to limited computing resources. In this work, we present the Occlusion-aware Recurrent binocular Stereo matching (ORStereo), which deals with these issues by only training on available low disparity range stereo images. ORStereo generalizes to unseen high-resolution images with large disparity ranges by formulating the task as residual updates and refinements of an initial prediction. ORStereo is trained on images with disparity ranges limited to 256 pixels, yet it can operate 4K-resolution input with over 1000 disparities using limited GPU memory. We test the model's capability on both synthetic and real-world high-resolution images. Experimental results demonstrate that ORStereo achieves comparable performance on 4K-resolution images compared to state-of-the-art methods trained on large disparity ranges. Compared to other methods that are only trained on low-resolution images, our method is 70% more accurate on 4K-resolution images.
ROMar 31, 2020
TartanAir: A Dataset to Push the Limits of Visual SLAMWenshan Wang, Delong Zhu, Xiangwei Wang et al.
We present a challenging dataset, the TartanAir, for robot navigation tasks and more. The data is collected in photo-realistic simulation environments with the presence of moving objects, changing light and various weather conditions. By collecting data in simulations, we are able to obtain multi-modal sensor data and precise ground truth labels such as the stereo RGB image, depth image, segmentation, optical flow, camera poses, and LiDAR point cloud. We set up large numbers of environments with various styles and scenes, covering challenging viewpoints and diverse motion patterns that are difficult to achieve by using physical data collection platforms. In order to enable data collection at such a large scale, we develop an automatic pipeline, including mapping, trajectory sampling, data processing, and data verification. We evaluate the impact of various factors on visual SLAM algorithms using our data. The results of state-of-the-art algorithms reveal that the visual SLAM problem is far from solved. Methods that show good performance on established datasets such as KITTI do not perform well in more difficult scenarios. Although we use the simulation, our goal is to push the limits of Visual SLAM algorithms in the real world by providing a challenging benchmark for testing new methods, while also using a large diverse training data for learning-based methods. Our dataset is available at \url{http://theairlab.org/tartanair-dataset}.
CVNov 23, 2019
Deep-Learning Assisted High-Resolution Binocular Stereo Depth ReconstructionYaoyu Hu, Weikun Zhen, Sebastian Scherer
This work presents dense stereo reconstruction using high-resolution images for infrastructure inspections. The state-of-the-art stereo reconstruction methods, both learning and non-learning ones, consume too much computational resource on high-resolution data. Recent learning-based methods achieve top ranks on most benchmarks. However, they suffer from the generalization issue due to lack of task-specific training data. We propose to use a less resource demanding non-learning method, guided by a learning-based model, to handle high-resolution images and achieve accurate stereo reconstruction. The deep-learning model produces an initial disparity prediction with uncertainty for each pixel of the down-sampled stereo image pair. The uncertainty serves as a self-measurement of its generalization ability and the per-pixel searching range around the initially predicted disparity. The downstream process performs a modified version of the Semi-Global Block Matching method with the up-sampled per-pixel searching range. The proposed deep-learning assisted method is evaluated on the Middlebury dataset and high-resolution stereo images collected by our customized binocular stereo camera. The combination of learning and non-learning methods achieves better performance on 12 out of 15 cases of the Middlebury dataset. In our infrastructure inspection experiments, the average 3D reconstruction error is less than 0.004m.
RONov 8, 2019
LiDAR Enhanced Structure-from-MotionWeikun Zhen, Yaoyu Hu, Huai Yu et al.
Although Structure-from-Motion (SfM) as a maturing technique has been widely used in many applications, state-of-the-art SfM algorithms are still not robust enough in certain situations. For example, images for inspection purposes are often taken in close distance to obtain detailed textures, which will result in less overlap between images and thus decrease the accuracy of estimated motion. In this paper, we propose a LiDAR-enhanced SfM pipeline that jointly processes data from a rotating LiDAR and a stereo camera pair to estimate sensor motions. We show that incorporating LiDAR helps to effectively reject falsely matched images and significantly improve the model consistency in large-scale environments. Experiments are conducted in different environments to test the performance of the proposed pipeline and comparison results with the state-of-the-art SfM algorithms are reported.
ROJul 1, 2019
A Joint Optimization Approach of LiDAR-Camera Fusion for Accurate Dense 3D ReconstructionsWeikun Zhen, Yaoyu Hu, Jingfeng Liu et al.
Fusing data from LiDAR and camera is conceptually attractive because of their complementary properties. For instance, camera images are higher resolution and have colors, while LiDAR data provide more accurate range measurements and have a wider Field Of View (FOV). However, the sensor fusion problem remains challenging since it is difficult to find reliable correlations between data of very different characteristics (geometry vs. texture, sparse vs. dense). This paper proposes an offline LiDAR-camera fusion method to build dense, accurate 3D models. Specifically, our method jointly solves a bundle adjustment (BA) problem and a cloud registration problem to compute camera poses and the sensor extrinsic calibration. In experiments, we show that our method can achieve an averaged accuracy of 2.7mm and resolution of 70 points per square cm by comparing to the ground truth data from a survey scanner. Furthermore, the extrinsic calibration result is discussed and shown to outperform the state-of-the-art method.