Wim Abbeloos

h-index28

14papers

349citations

Novelty41%

AI Score44

Ranked #72,068 of 205,806 authors (top 35%)#25,205 in CV (top 43%)

14 Papers

CVDec 11, 2025Code

Video Depth Propagation

Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis et al.

Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth

CVFeb 27, 2025Code

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang et al.

Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth

CVMar 20, 2025Code

UniK3D: Universal Camera Monocular 3D Estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu et al.

Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .

CVSep 11, 2025Code

DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli et al.

Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion

CVOct 1, 2021

MonoCInIS: Camera Independent Monocular 3D Object Detection using Instance Segmentation

Jonas Heylen, Mark De Wolf, Bruno Dawagne et al.

Monocular 3D object detection has recently shown promising results, however there remain challenging problems. One of those is the lack of invariance to different camera intrinsic parameters, which can be observed across different 3D object datasets. Little effort has been made to exploit the combination of heterogeneous 3D object datasets. In contrast to general intuition, we show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data. In this paper we propose a category-level pose estimation method based on instance segmentation, using camera independent geometric reasoning to cope with the varying camera viewpoints and intrinsics of different datasets. Every pixel of an instance predicts the object dimensions, the 3D object reference points projected in 2D image space and, optionally, the local viewing angle. Camera intrinsics are only used outside of the learned network to lift the predicted 2D reference points to 3D. We surpass camera independent methods on the challenging KITTI3D benchmark and show the key benefits compared to camera dependent methods.

CVApr 27, 2021

ACDC: The Adverse Conditions Dataset with Correspondences for Robust Semantic Driving Scene Perception

Christos Sakaridis, Haoran Wang, Ke Li et al.

Level-5 driving automation requires a robust visual perception system that can parse input images under any condition. However, existing driving datasets for dense semantic perception are either dominated by images captured under normal conditions or are small in scale. To address this, we introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing methods for diverse semantic perception tasks on adverse visual conditions. ACDC consists of a large set of 8012 images, half of which (4006) are equally distributed between four common adverse conditions: fog, nighttime, rain, and snow. Each adverse-condition image comes with a high-quality pixel-level panoptic annotation, a corresponding image of the same scene under normal conditions, and a binary mask that distinguishes between intra-image regions of clear and uncertain semantic content. 1503 of the corresponding normal-condition images feature panoptic annotations, raising the total annotated images to 5509. ACDC supports the standard tasks of semantic segmentation, object detection, instance segmentation, and panoptic segmentation, as well as the newly introduced uncertainty-aware semantic segmentation. A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches and indicates the value of our dataset in steering future progress in the field. Our dataset and benchmark are publicly available at https://acdc.vision.ee.ethz.ch

CVOct 17, 2017

3D Object Discovery and Modeling Using Single RGB-D Images Containing Multiple Object Instances

Wim Abbeloos, Esra Ataer-Cansizoglu, Sergio Caccamo et al.

Unsupervised object modeling is important in robotics, especially for handling a large set of objects. We present a method for unsupervised 3D object discovery, reconstruction, and localization that exploits multiple instances of an identical object contained in a single RGB-D image. The proposed method does not rely on segmentation, scene knowledge, or user input, and thus is easily scalable. Our method aims to find recurrent patterns in a single RGB-D image by utilizing appearance and geometry of the salient regions. We extract keypoints and match them in pairs based on their descriptors. We then generate triplets of the keypoints matching with each other using several geometric criteria to minimize false matches. The relative poses of the matched triplets are computed and clustered to discover sets of triplet pairs with similar relative poses. Triplets belonging to the same set are likely to belong to the same object and are used to construct an initial object model. Detection of remaining instances with the initial object model using RANSAC allows to further expand and refine the model. The automatically generated object models are both compact and descriptive. We show quantitative and qualitative results on RGB-D images with various objects including some from the Amazon Picking Challenge. We also demonstrate the use of our method in an object picking scenario with a robotic arm.

CVJul 23, 2017

Detecting and Grouping Identical Objects for Region Proposal and Classification

Wim Abbeloos, Sergio Caccamo, Esra Ataer-Cansizoglu et al.

Often multiple instances of an object occur in the same scene, for example in a warehouse. Unsupervised multi-instance object discovery algorithms are able to detect and identify such objects. We use such an algorithm to provide object proposals to a convolutional neural network (CNN) based classifier. This results in fewer regions to evaluate, compared to traditional region proposal algorithms. Additionally, it enables using the joint probability of multiple instances of an object, resulting in improved classification accuracy. The proposed technique can also split a single class into multiple sub-classes corresponding to the different object types, enabling hierarchical classification.

CVJul 23, 2017

Team Applied Robotics: A closer look at our robotic picking system

Wim Abbeloos, Fabian Gouwens, Simon Jansen et al.

This paper describes the vision based robotic picking system that was developed by our team, Team Applied Robotics, for the Amazon Picking Challenge 2016. This competition challenged teams to develop a robotic system that is able to pick a large variety of products from a shelve or a tote. We discuss the design considerations and our strategy, the high resolution 3D vision system, the use of a combination of texture and shape-based object detection algorithms, the robot path planning and object manipulators that were developed.

CVDec 7, 2016

Exploring the potential of combining time of flight and thermal infrared cameras for person detection

Wim Abbeloos, Toon Goedemé

Combining new, low-cost thermal infrared and time-of-flight range sensors provides new opportunities. In this position paper we explore the possibilities of combining these sensors and using their fused data for person detection. The proposed calibration approach for this sensor combination differs from the traditional stereo camera calibration in two fundamental ways. A first distinction is that the spectral sensitivity of the two sensors differs significantly. In fact, there is no sensitivity range overlap at all. A second distinction is that their resolution is typically very low, which requires special attention. We assume a situation in which the sensors' relative position is known, but their orientation is unknown. In addition, some of the typical measurement errors are discussed, and methods to compensate for them are proposed. We discuss how the fused data could allow increased accuracy and robustness without the need for complex algorithms requiring large amounts of computational power and training data.

CVDec 7, 2016

Process Monitoring of Extrusion Based 3D Printing via Laser Scanning

Matthias Faes, Wim Abbeloos, Frederik Vogeler et al.

Extrusion based 3D Printing (E3DP) is an Additive Manufacturing (AM) technique that extrudes thermoplastic polymer in order to build up components using a layerwise approach. Hereby, AM typically requires long production times in comparison to mass production processes such as Injection Molding. Failures during the AM process are often only noticed after build completion and frequently lead to part rejection because of dimensional inaccuracy or lack of mechanical performance, resulting in an important loss of time and material. A solution to improve the accuracy and robustness of a manufacturing technology is the integration of sensors to monitor and control process state-variables online. In this way, errors can be rapidly detected and possibly compensated at an early stage. To achieve this, we integrated a modular 2D laser triangulation scanner into an E3DP machine and analyzed feedback signals. A 2D laser triangulation scanner was selected here owing to the very compact size, achievable accuracy and the possibility of capturing geometrical 3D data. Thus, our implemented system is able to provide both quantitative and qualitative information. Also, in this work, first steps towards the development of a quality control loop for E3DP processes are presented and opportunities are discussed.

CVDec 7, 2016

Embedded Line Scan Image Sensors: The Low Cost Alternative for High Speed Imaging

Stef Van Wolputte, Wim Abbeloos, Stijn Helsen et al.

In this paper we propose a low-cost high-speed imaging line scan system. We replace an expensive industrial line scan camera and illumination with a custom-built set-up of cheap off-the-shelf components, yielding a measurement system with comparative quality while costing about 20 times less. We use a low-cost linear (1D) image sensor, cheap optics including a LED-based or LASER-based lighting and an embedded platform to process the images. A step-by-step method to design such a custom high speed imaging system and select proper components is proposed. Simulations allowing to predict the final image quality to be obtained by the set-up has been developed. Finally, we applied our method in a lab, closely representing the real-life cases. Our results shows that our simulations are very accurate and that our low-cost line scan set-up acquired image quality compared to the high-end commercial vision system, for a fraction of the price.

CVDec 7, 2016

Fusion of Range and Thermal Images for Person Detection

Wim Abbeloos, Toon Goedemé

Detecting people in images is a challenging problem. Differences in pose, clothing and lighting, along with other factors, cause a lot of variation in their appearance. To overcome these issues, we propose a system based on fused range and thermal infrared images. These measurements show considerably less variation and provide more meaningful information. We provide a brief introduction to the sensor technology used and propose a calibration method. Several data fusion algorithms are compared and their performance is assessed on a simulated data set. The results of initial experiments on real data are analyzed and the measurement errors and the challenges they present are discussed. The resulting fused data are used to efficiently detect people in a fixed camera set-up. The system is extended to include person tracking.

CVDec 5, 2016

Point Pair Feature based Object Detection for Random Bin Picking

Wim Abbeloos, Toon Goedemé

Point pair features are a popular representation for free form 3D object detection and pose estimation. In this paper, their performance in an industrial random bin picking context is investigated. A new method to generate representative synthetic datasets is proposed. This allows to investigate the influence of a high degree of clutter and the presence of self similar features, which are typical to our application. We provide an overview of solutions proposed in literature and discuss their strengths and weaknesses. A simple heuristic method to drastically reduce the computational complexity is introduced, which results in improved robustness, speed and accuracy compared to the naive approach.