Oliver Wasenmüller

h-index13

37papers

564citations

Novelty44%

AI Score51

Ranked #16,774 of 194,257 authors (top 9%)#6,092 in CV (top 10%)

37 Papers

2.8CVAug 18, 2023

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Nikolas Ebert, Didier Stricker, Oliver Wasenmüller

Many medical or pharmaceutical processes have strict guidelines regarding continuous hygiene monitoring. This often involves the labor-intensive task of manually counting microorganisms in Petri dishes by trained personnel. Automation attempts often struggle due to major challenges: significant scaling differences, low separation, low contrast, etc. To address these challenges, we introduce AttnPAFPN, a high-resolution detection pipeline that leverages a novel transformer variation, the efficient-global self-attention mechanism. Our streamlined approach can be easily integrated in almost any multi-scale object detection pipeline. In a comprehensive evaluation on the publicly available AGAR dataset, we demonstrate the superior accuracy of our network over the current state-of-the-art. In order to demonstrate the task-independent performance of our approach, we perform further experiments on COCO and LIVECell datasets.

3.9CVJul 18, 2023

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Nikolas Ebert, Laurenz Reichardt, Didier Stricker et al.

While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.

5.7CVMay 3, 2022

Multitask Network for Joint Object Detection, Semantic Segmentation and Human Pose Estimation in Vehicle Occupancy Monitoring

Nikolas Ebert, Patrick Mangat, Oliver Wasenmüller

In order to ensure safe autonomous driving, precise information about the conditions in and around the vehicle must be available. Accordingly, the monitoring of occupants and objects inside the vehicle is crucial. In the state-of-the-art, single or multiple deep neural networks are used for either object recognition, semantic segmentation, or human pose estimation. In contrast, we propose our Multitask Detection, Segmentation and Pose Estimation Network (MDSP) -- the first multitask network solving all these three tasks jointly in the area of occupancy monitoring. Due to the shared architecture, memory and computing costs can be saved while achieving higher accuracy. Furthermore, our architecture allows a flexible combination of the three mentioned tasks during a simple end-to-end training. We perform comprehensive evaluations on the public datasets SVIRO and TiCaM in order to demonstrate the superior performance.

10.4CVSep 12, 2023

360$^\circ$ from a Single Camera: A Few-Shot Approach for LiDAR Segmentation

Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmüller

Deep learning applications on LiDAR data suffer from a strong domain gap when applied to different sensors or tasks. In order for these methods to obtain similar accuracy on different data in comparison to values reported on public benchmarks, a large scale annotated dataset is necessary. However, in practical applications labeled data is costly and time consuming to obtain. Such factors have triggered various research in label-efficient methods, but a large gap remains to their fully-supervised counterparts. Thus, we propose ImageTo360, an effective and streamlined few-shot approach to label-efficient LiDAR segmentation. Our method utilizes an image teacher network to generate semantic predictions for LiDAR data within a single camera view. The teacher is used to pretrain the LiDAR segmentation student network, prior to optional fine-tuning on 360$^\circ$ data. Our method is implemented in a modular manner on the point level and as such is generalizable to different architectures. We improve over the current state-of-the-art results for label-efficient methods and even surpass some traditional fully-supervised segmentation networks.

4.4CVMay 13

Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

Arka Bhowmick, Enes Ozeren, Ahmed Abdullah et al.

In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

5.5CVApr 27

PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms

Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmüller

3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7\% of PointTransformer V3's accuracy on ScanNet with 79.2\% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.

6.0CVApr 29

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

4.7CVApr 17

SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification

Alexander Musiat, Nikolas Ebert, Oliver Wasenmüller

Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.

7.6CVAug 26, 2024Code

GenFormer -- Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets

Sven Oehri, Nikolas Ebert, Ahmed Abdullah et al.

Recent studies showcase the competitive accuracy of Vision Transformers (ViTs) in relation to Convolutional Neural Networks (CNNs), along with their remarkable robustness. However, ViTs demand a large amount of data to achieve adequate performance, which makes their application to small datasets challenging, falling behind CNNs. To overcome this, we propose GenFormer, a data augmentation strategy utilizing generated images, thereby improving transformer accuracy and robustness on small-scale image classification tasks. In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet by transferring established ImageNet generalization and robustness benchmarks to the small-scale data domain. Similarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants of established fine-grained datasets in the medical and aerial domain. Through a series of experiments conducted on small datasets of various domains, including Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate the synergistic power of our method, in particular when combined with common train and test time augmentations, knowledge distillation, and architectural design choices. Additionally, we prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness, bridging the gap between CNNs and ViTs in the small-scale dataset domain.

6.5CVAug 26, 2024Code

Text3DAug -- Prompted Instance Augmentation for LiDAR Perception

Laurenz Reichardt, Luca Uhr, Oliver Wasenmüller

LiDAR data of urban scenarios poses unique challenges, such as heterogeneous characteristics and inherent class imbalance. Therefore, large-scale datasets are necessary to apply deep learning methods. Instance augmentation has emerged as an efficient method to increase dataset diversity. However, current methods require the time-consuming curation of 3D models or costly manual data annotation. To overcome these limitations, we propose Text3DAug, a novel approach leveraging generative models for instance augmentation. Text3DAug does not depend on labeled data and is the first of its kind to generate instances and annotations from text. This allows for a fully automated pipeline, eliminating the need for manual effort in practical applications. Additionally, Text3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor used. Comprehensive experimental analysis on LiDAR segmentation, detection and novel class discovery demonstrates that Text3DAug is effective in supplementing existing methods or as a standalone method, performing on par or better than established methods, however while overcoming their specific drawbacks. The code is publicly available.

13.2CVAug 18, 2020Code

DeepLiDARFlow: A Deep Learning Architecture For Scene Flow Estimation Using Monocular Camera and Sparse LiDAR

Rishav, Ramy Battrawy, René Schuster et al.

Scene flow is the dense 3D reconstruction of motion and geometry of a scene. Most state-of-the-art methods use a pair of stereo images as input for full scene reconstruction. These methods depend a lot on the quality of the RGB images and perform poorly in regions with reflective objects, shadows, ill-conditioned light environment and so on. LiDAR measurements are much less sensitive to the aforementioned conditions but LiDAR features are in general unsuitable for matching tasks due to their sparse nature. Hence, using both LiDAR and RGB can potentially overcome the individual disadvantages of each sensor by mutual improvement and yield robust features which can improve the matching process. In this paper, we present DeepLiDARFlow, a novel deep learning architecture which fuses high level RGB and LiDAR features at multiple scales in a monocular setup to predict dense scene flow. Its performance is much better in the critical regions where image-only and LiDAR-only methods are inaccurate. We verify our DeepLiDARFlow using the established data sets KITTI and FlyingThings3D and we show strong robustness compared to several state-of-the-art methods which used other input modalities. The code of our paper is available at https://github.com/dfki-av/DeepLiDARFlow.

3.6CVJan 27, 2025

D-PLS: Decoupled Semantic Segmentation for 4D-Panoptic-LiDAR-Segmentation

Maik Steinhauser, Laurenz Reichardt, Nikolas Ebert et al.

This paper introduces a novel approach to 4D Panoptic LiDAR Segmentation that decouples semantic and instance segmentation, leveraging single-scan semantic predictions as prior information for instance segmentation. Our method D-PLS first performs single-scan semantic segmentation and aggregates the results over time, using them to guide instance segmentation. The modular design of D-PLS allows for seamless integration on top of any semantic segmentation architecture, without requiring architectural changes or retraining. We evaluate our approach on the SemanticKITTI dataset, where it demonstrates significant improvements over the baseline in both classification and association tasks, as measured by the LiDAR Segmentation and Tracking Quality (LSTQ) metric. Furthermore, we show that our decoupled architecture not only enhances instance prediction but also surpasses the baseline due to advancements in single-scan semantic segmentation.

3.6CVJan 17, 2025

Classifier Ensemble for Efficient Uncertainty Calibration of Deep Neural Networks for Image Classification

Michael Schulze, Nikolas Ebert, Laurenz Reichardt et al.

This paper investigates novel classifier ensemble techniques for uncertainty calibration applied to various deep neural networks for image classification. We evaluate both accuracy and calibration metrics, focusing on Expected Calibration Error (ECE) and Maximum Calibration Error (MCE). Our work compares different methods for building simple yet efficient classifier ensembles, including majority voting and several metamodel-based approaches. Our evaluation reveals that while state-of-the-art deep neural networks for image classification achieve high accuracy on standard datasets, they frequently suffer from significant calibration errors. Basic ensemble techniques like majority voting provide modest improvements, while metamodel-based ensembles consistently reduce ECE and MCE across all architectures. Notably, the largest of our compared metamodels demonstrate the most substantial calibration improvements, with minimal impact on accuracy. Moreover, classifier ensembles with metamodels outperform traditional model ensembles in calibration performance, while requiring significantly fewer parameters. In comparison to traditional post-hoc calibration methods, our approach removes the need for a separate calibration dataset. These findings underscore the potential of our proposed metamodel-based classifier ensembles as an efficient and effective approach to improving model calibration, thereby contributing to more reliable deep learning systems.

2.6CVOct 21, 2021

Detection of Driver Drowsiness by Calculating the Speed of Eye Blinking

Muhammad Fawwaz Yusri, Patrick Mangat, Oliver Wasenmüller

Many road accidents are caused by drowsiness of the driver. While there are methods to detect closed eyes, it is a non-trivial task to detect the gradual process of a driver becoming drowsy. We consider a simple real-time detection system for drowsiness merely based on the eye blinking rate derived from the eye aspect ratio. For the eye detection we use HOG and a linear SVM. If the speed of the eye blinking drops below some empirically determined threshold, the system triggers an alarm, hence preventing the driver from falling into microsleep. In this paper, we extensively evaluate the minimal requirements for the proposed system. We find that this system works well if the face is directed to the camera, but it becomes less reliable once the head is tilted significantly. The results of our evaluations provide the foundation for further developments of our drowsiness detection system.

4.7CVJul 14, 2021

PDC: Piecewise Depth Completion utilizing Superpixels

Dennis Teutscher, Patrick Mangat, Oliver Wasenmüller

Depth completion from sparse LiDAR and high-resolution RGB data is one of the foundations for autonomous driving techniques. Current approaches often rely on CNN-based methods with several known drawbacks: flying pixel at depth discontinuities, overfitting to both a given data set as well as error metric, and many more. Thus, we propose our novel Piecewise Depth Completion (PDC), which works completely without deep learning. PDC segments the RGB image into superpixels corresponding the regions with similar depth value. Superpixels corresponding to same objects are gathered using a cost map. At the end, we receive detailed depth images with state of the art accuracy. In our evaluation, we can show both the influence of the individual proposed processing steps and the overall performance of our method on the challenging KITTI dataset.

3.7CVJul 14, 2021

DVMN: Dense Validity Mask Network for Depth Completion

Laurenz Reichardt, Patrick Mangat, Oliver Wasenmüller

LiDAR depth maps provide environmental guidance in a variety of applications. However, such depth maps are typically sparse and insufficient for complex tasks such as autonomous navigation. State of the art methods use image guided neural networks for dense depth completion. We develop a guided convolutional neural network focusing on gathering dense and valid information from sparse depth maps. To this end, we introduce a novel layer with spatially variant and content-depended dilation to include additional data from sparse input. Furthermore, we propose a sparsity invariant residual bottleneck block. We evaluate our Dense Validity Mask Network (DVMN) on the KITTI depth completion benchmark and achieve state of the art results. At the time of submission, our network is the leading method using sparsity invariant convolution.

4.7CVMay 7, 2021

Autoencoder Based Inter-Vehicle Generalization for In-Cabin Occupant Classification

Steve Dias Da Cruz, Bertram Taetz, Oliver Wasenmüller et al.

Common domain shift problem formulations consider the integration of multiple source domains, or the target domain during training. Regarding the generalization of machine learning models between different car interiors, we formulate the criterion of training in a single vehicle: without access to the target distribution of the vehicle the model would be deployed to, neither with access to multiple vehicles during training. We performed an investigation on the SVIRO dataset for occupant classification on the rear bench and propose an autoencoder based approach to improve the transferability. The autoencoder is on par with commonly used classification models when trained from scratch and sometimes out-performs models pre-trained on a large amount of data. Moreover, the autoencoder can transform images from unknown vehicles into the vehicle it was trained on. These results are corroborated by an evaluation on real infrared images from two vehicle interiors.

12.4CVOct 16, 2020

HPERL: 3D Human Pose Estimation from RGB and LiDAR

Michael Fürst, Shriya T. P. Gupta, René Schuster et al.

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX [1]. This allows for many new opportunities in the field of 3D human pose estimation.

9.1CVAug 21, 2020

SSGP: Sparse Spatial Guided Propagation for Robust and Generic Interpolation

René Schuster, Oliver Wasenmüller, Christian Unger et al.

Interpolation of sparse pixel information towards a dense target resolution finds its application across multiple disciplines in computer vision. State-of-the-art interpolation of motion fields applies model-based interpolation that makes use of edge information extracted from the target image. For depth completion, data-driven learning approaches are widespread. Our work is inspired by latest trends in depth completion that tackle the problem of dense guidance for sparse information. We extend these ideas and create a generic cross-domain architecture that can be applied for a multitude of interpolation problems like optical flow, scene flow, or depth completion. In our experiments, we show that our proposed concept of Sparse Spatial Guided Propagation (SSGP) achieves improvements to robustness, accuracy, or speed compared to specialized algorithms.

3.3CVJun 22, 2020

ResFPN: Residual Skip Connections in Multi-Resolution Feature Pyramid Networks for Accurate Dense Pixel Matching

Rishav, René Schuster, Ramy Battrawy et al.

Dense pixel matching is required for many computer vision algorithms such as disparity, optical flow or scene flow estimation. Feature Pyramid Networks (FPN) have proven to be a suitable feature extractor for CNN-based dense matching tasks. FPN generates well localized and semantically strong features at multiple scales. However, the generic FPN is not utilizing its full potential, due to its reasonable but limited localization accuracy. Thus, we present ResFPN -- a multi-resolution feature pyramid network with multiple residual skip connections, where at any scale, we leverage the information from higher resolution maps for stronger and better localized features. In our ablation study, we demonstrate the effectiveness of our novel architecture with clearly higher accuracy than FPN. In addition, we verify the superior accuracy of ResFPN in many different pixel matching applications on established datasets like KITTI, Sintel, and FlyingThings3D.

7.9CVJun 17, 2020

LRPD: Long Range 3D Pedestrian Detection Leveraging Specific Strengths of LiDAR and RGB

Michael Fürst, Oliver Wasenmüller, Didier Stricker

While short range 3D pedestrian detection is sufficient for emergency breaking, long range detections are required for smooth breaking and gaining trust in autonomous vehicles. The current state-of-the-art on the KITTI benchmark performs suboptimal in detecting the position of pedestrians at long range. Thus, we propose an approach specifically targeting long range 3D pedestrian detection (LRPD), leveraging the density of RGB and the precision of LiDAR. Therefore, for proposals, RGB instance segmentation and LiDAR point based proposal generation are combined, followed by a second stage using both sensor modalities symmetrically. This leads to a significant improvement in mAP on long range compared to the current state-of-the art. The evaluation of our LRPD approach was done on the pedestrians from the KITTI benchmark.

6.5CVJan 28, 2020

DFKI Cabin Simulator: A Test Platform for Visual In-Cabin Monitoring Functions

Hartmut Feld, Bruno Mirbach, Jigyasa Katrolia et al.

We present a test platform for visual in-cabin scene analysis and occupant monitoring functions. The test platform is based on a driving simulator developed at the DFKI, consisting of a realistic in-cabin mock-up and a wide-angle projection system for a realistic driving experience. The platform has been equipped with a wide-angle 2D/3D camera system monitoring the entire interior of the vehicle mock-up of the simulator. It is also supplemented with a ground truth reference sensor system that allows to track and record the occupant's body movements synchronously with the 2D and 3D video streams of the camera. Thus, the resulting test platform will serve as a basis to validate numerous in-cabin monitoring functions, which are important for the realization of novel human-vehicle interfaces, advanced driver assistant systems, and automated driving. Among the considered functions are occupant presence detection, size and 3D-pose estimation and driver intention recognition. In addition, our platform will be the basis for the creation of large-scale in-cabin benchmark datasets.

13.6CVJan 10, 2020

SVIRO: Synthetic Vehicle Interior Rear Seat Occupancy Dataset and Benchmark

Steve Dias Da Cruz, Oliver Wasenmüller, Hans-Peter Beise et al.

We release SVIRO, a synthetic dataset for sceneries in the passenger compartment of ten different vehicles, in order to analyze machine learning-based approaches for their generalization capacities and reliability when trained on a limited number of variations (e.g. identical backgrounds and textures, few instances per class). This is in contrast to the intrinsically high variability of common benchmark datasets, which focus on improving the state-of-the-art of general tasks. Our dataset contains bounding boxes for object detection, instance segmentation masks, keypoints for pose estimation and depth images for each synthetic scenery as well as images for each individual seat for classification. The advantage of our use-case is twofold: The proximity to a realistic application to benchmark new approaches under novel circumstances while reducing the complexity to a more tractable environment, such that applications and theoretical questions can be tested on a more challenging dataset as toy problems. The data and evaluation server are available under https://sviro.kl.dfki.de.

12.4CVOct 31, 2019

LiDAR-Flow: Dense Scene Flow Estimation from Sparse LiDAR and Stereo Images

Ramy Battrawy, René Schuster, Oliver Wasenmüller et al.

We propose a new approach called LiDAR-Flow to robustly estimate a dense scene flow by fusing a sparse LiDAR with stereo images. We take the advantage of the high accuracy of LiDAR to resolve the lack of information in some regions of stereo images due to textureless objects, shadows, ill-conditioned light environment and many more. Additionally, this fusion can overcome the difficulty of matching unstructured 3D points between LiDAR-only scans. Our LiDAR-Flow approach consists of three main steps; each of them exploits LiDAR measurements. First, we build strong seeds from LiDAR to enhance the robustness of matches between stereo images. The imagery part seeks the motion matches and increases the density of scene flow estimation. Then, a consistency check employs LiDAR seeds to remove the possible mismatches. Finally, LiDAR measurements constraint the edge-preserving interpolation method to fill the remaining gaps. In our evaluation we investigate the individual processing steps of our LiDAR-Flow approach and demonstrate the superior performance compared to image-only approach.

2.6CVJul 31, 2019

Rapid Light Field Depth Estimation with Semi-Global Matching

Yuriy Anisimov, Oliver Wasenmüller, Didier Stricker

Running time of the light field depth estimation algorithms is typically high. This assessment is based on the computational complexity of existing methods and the large amounts of data involved. The aim of our work is to develop a simple and fast algorithm for accurate depth computation. In this context, we propose an approach, which involves Semi-Global Matching for the processing of light field images. It forms on comparison of pixels' correspondences with different metrics in the substantially bounded light field space. We show that our method is suitable for the fast production of a proper result in a variety of light field configurations

3.4CVJul 25, 2019

A Compact Light Field Camera for Real-Time Depth Estimation

Yuriy Anisimov, Oliver Wasenmüller, Didier Stricker

Depth cameras are utilized in many applications. Recently light field approaches are increasingly being used for depth computation. While these approaches demonstrate the technical feasibility, they can not be brought into real-world application, since they have both a high computation time as well as a large design. Exactly these two drawbacks are overcome in this paper. For the first time, we present a depth camera based on the light field principle, which provides real-time depth information as well as a compact design.

5.4CVApr 29, 2019

DeLiO: Decoupled LiDAR Odometry

Queens Maria Thomas, Oliver Wasenmüller, Didier Stricker

Most LiDAR odometry algorithms estimate the transformation between two consecutive frames by estimating the rotation and translation in an intervening fashion. In this paper, we propose our Decoupled LiDAR Odometry (DeLiO), which -- for the first time -- decouples the rotation estimation completely from the translation estimation. In particular, the rotation is estimated by extracting the surface normals from the input point clouds and tracking their characteristic pattern on a unit sphere. Using this rotation the point clouds are unrotated so that the underlying transformation is pure translation, which can be easily estimated using a line cloud approach. An evaluation is performed on the KITTI dataset and the results are compared against state-of-the-art algorithms.

1.8CVApr 12, 2019

An Empirical Evaluation Study on the Training of SDC Features for Dense Pixel Matching

René Schuster, Oliver Wasenmüller, Christian Unger et al.

Training a deep neural network is a non-trivial task. Not only the tuning of hyperparameters, but also the gathering and selection of training data, the design of the loss function, and the construction of training schedules is important to get the most out of a model. In this study, we perform a set of experiments all related to these issues. The model for which different training strategies are investigated is the recently presented SDC descriptor network (stacked dilated convolution). It is used to describe images on pixel-level for dense matching tasks. Our work analyzes SDC in more detail, validates some best practices for training deep neural networks, and provides insights into training with multiple domain data.

12.4CVApr 12, 2019Code

PWOC-3D: Deep Occlusion-Aware End-to-End Scene Flow Estimation

Rohan Saxena, René Schuster, Oliver Wasenmüller et al.

In the last few years, convolutional neural networks (CNNs) have demonstrated increasing success at learning many computer vision tasks including dense estimation problems such as optical flow and stereo matching. However, the joint prediction of these tasks, called scene flow, has traditionally been tackled using slow classical methods based on primitive assumptions which fail to generalize. The work presented in this paper overcomes these drawbacks efficiently (in terms of speed and accuracy) by proposing PWOC-3D, a compact CNN architecture to predict scene flow from stereo image sequences in an end-to-end supervised setting. Further, large motion and occlusions are well-known problems in scene flow estimation. PWOC-3D employs specialized design decisions to explicitly model these challenges. In this regard, we propose a novel self-supervised strategy to predict occlusions from images (learned without any labeled occlusion data). Leveraging several such constructs, our network achieves competitive results on the KITTI benchmark and the challenging FlyingThings3D dataset. Especially on KITTI, PWOC-3D achieves the second place among end-to-end deep learning methods with 48 times fewer parameters than the top-performing method.

7.6CVApr 5, 2019

SDC - Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks

René Schuster, Oliver Wasenmüller, Christian Unger et al.

Dense pixel matching is important for many computer vision tasks such as disparity and flow estimation. We present a robust, unified descriptor network that considers a large context region with high spatial variance. Our network has a very large receptive field and avoids striding layers to maintain spatial resolution. These properties are achieved by creating a novel neural network layer that consists of multiple, parallel, stacked dilated convolutions (SDC). Several of these layers are combined to form our SDC descriptor network. In our experiments, we show that our SDC features outperform state-of-the-art feature descriptors in terms of accuracy and robustness. In addition, we demonstrate the superior performance of SDC in state-of-the-art stereo matching, optical flow and scene flow algorithms on several famous public benchmarks.

7.1CVFeb 26, 2019

SceneFlowFields++: Multi-frame Matching, Visibility Prediction, and Robust Interpolation for Scene Flow Estimation

René Schuster, Oliver Wasenmüller, Christian Unger et al.

State-of-the-art scene flow algorithms pursue the conflicting targets of accuracy, run time, and robustness. With the successful concept of pixel-wise matching and sparse-to-dense interpolation, we push the limits of scene flow estimation. Avoiding strong assumptions on the domain or the problem yields a more robust algorithm. This algorithm is fast because we avoid explicit regularization during matching, which allows an efficient computation. Using image information from multiple time steps and explicit visibility prediction based on previous results, we achieve competitive performances on different data sets. Our contributions and results are evaluated in comparative experiments. Overall, we present an accurate scene flow algorithm that is faster and more generic than any individual benchmark leader.

0.9CVAug 30, 2018

Automated Scene Flow Data Generation for Training and Verification

Oliver Wasenmüller, René Schuster, Didier Stricker et al.

Scene flow describes the 3D position as well as the 3D motion of each pixel in an image. Such algorithms are the basis for many state-of-the-art autonomous or automated driving functions. For verification and training large amounts of ground truth data is required, which is not available for real data. In this paper, we demonstrate a technology to create synthetic data with dense and precise scene flow ground truth.

3.3CVAug 30, 2018

Dense Scene Flow from Stereo Disparity and Optical Flow

René Schuster, Oliver Wasenmüller, Didier Stricker

Scene flow describes 3D motion in a 3D scene. It can either be modeled as a single task, or it can be reconstructed from the auxiliary tasks of stereo depth and optical flow estimation. While the second method can achieve real-time performance by using real-time auxiliary methods, it will typically produce non-dense results. In this representation of a basic combination approach for scene flow estimation, we will tackle the problem of non-density by interpolation.

3.9CVJun 20, 2018

Dynamic Risk Assessment for Vehicles of Higher Automation Levels by Deep Learning

Patrik Feth, Mohammed Naveed Akram, René Schuster et al.

Vehicles of higher automation levels require the creation of situation awareness. One important aspect of this situation awareness is an understanding of the current risk of a driving situation. In this work, we present a novel approach for the dynamic risk assessment of driving situations based on images of a front stereo camera using deep learning. To this end, we trained a deep neural network with recorded monocular images, disparity maps and a risk metric for diverse traffic scenes. Our approach can be used to create the aforementioned situation awareness of vehicles of higher automation levels and can serve as a heterogeneous channel to systems based on radar or lidar sensors that are used traditionally for the calculation of risk metrics.

7.3CVMay 9, 2018

FlowFields++: Accurate Optical Flow Correspondences Meet Robust Interpolation

René Schuster, Christian Bailer, Oliver Wasenmüller et al.

Optical Flow algorithms are of high importance for many applications. Recently, the Flow Field algorithm and its modifications have shown remarkable results, as they have been evaluated with top accuracy on different data sets. In our analysis of the algorithm we have found that it produces accurate sparse matches, but there is room for improvement in the interpolation. Thus, we propose in this paper FlowFields++, where we combine the accurate matches of Flow Fields with a robust interpolation. In addition, we propose improved variational optimization as post-processing. Our new algorithm is evaluated on the challenging KITTI and MPI Sintel data sets with public top results on both benchmarks.

10.7CVJan 15, 2018

Combining Stereo Disparity and Optical Flow for Basic Scene Flow

René Schuster, Christian Bailer, Oliver Wasenmüller et al.

Scene flow is a description of real world motion in 3D that contains more information than optical flow. Because of its complexity there exists no applicable variant for real-time scene flow estimation in an automotive or commercial vehicle context that is sufficiently robust and accurate. Therefore, many applications estimate the 2D optical flow instead. In this paper, we examine the combination of top-performing state-of-the-art optical flow and stereo disparity algorithms in order to achieve a basic scene flow. On the public KITTI Scene Flow Benchmark we demonstrate the reasonable accuracy of the combination approach and show its speed in computation.

10.0CVOct 27, 2017

SceneFlowFields: Dense Interpolation of Sparse Scene Flow Correspondences

René Schuster, Oliver Wasenmüller, Georg Kuschk et al.

While most scene flow methods use either variational optimization or a strong rigid motion assumption, we show for the first time that scene flow can also be estimated by dense interpolation of sparse matches. To this end, we find sparse matches across two stereo image pairs that are detected without any prior regularization and perform dense interpolation preserving geometric and motion boundaries by using edge information. A few iterations of variational energy minimization are performed to refine our results, which are thoroughly evaluated on the KITTI benchmark and additionally compared to state-of-the-art on MPI Sintel. For application in an automotive context, we further show that an optional ego-motion model helps to boost performance and blends smoothly into our approach to produce a segmentation of the scene into static and dynamic parts.