Klaus Dietmayer

CV
h-index22
90papers
7,461citations
Novelty46%
AI Score48

90 Papers

CVJul 1, 2022Code
MotionMixer: MLP-based 3D Human Body Pose Forecasting

Arij Bouazizi, Adrian Holzbock, Ulrich Kressel et al.

In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer

CVJun 27, 2022Code
MGNet: Monocular Geometric Scene Understanding for Autonomous Driving

Markus Schön, Michael Buchholz, Klaus Dietmayer

We introduce MGNet, a multi-task framework for monocular geometric scene understanding. We define monocular geometric scene understanding as the combination of two known tasks: Panoptic segmentation and self-supervised monocular depth estimation. Panoptic segmentation captures the full scene not only semantically, but also on an instance basis. Self-supervised monocular depth estimation uses geometric constraints derived from the camera measurement model in order to measure depth from monocular video sequences only. To the best of our knowledge, we are the first to propose the combination of these two tasks in one single model. Our model is designed with focus on low latency to provide fast inference in real-time on a single consumer-grade GPU. During deployment, our model produces dense 3D point clouds with instance aware semantic labels from single high-resolution camera images. We evaluate our model on two popular autonomous driving benchmarks, i.e., Cityscapes and KITTI, and show competitive performance among other real-time capable methods. Source code is available at https://github.com/markusschoen/MGNet.

CVJun 22, 2023Code
Data-Free Backbone Fine-Tuning for Pruned Neural Networks

Adrian Holzbock, Achyut Hegde, Klaus Dietmayer et al.

Model compression techniques reduce the computational load and memory consumption of deep neural networks. After the compression operation, e.g. parameter pruning, the model is normally fine-tuned on the original training dataset to recover from the performance drop caused by compression. However, the training data is not always available due to privacy issues or other factors. In this work, we present a data-free fine-tuning approach for pruning the backbone of deep neural networks. In particular, the pruned network backbone is trained with synthetically generated images, and our proposed intermediate supervision to mimic the unpruned backbone's output feature map. Afterwards, the pruned backbone can be combined with the original network head to make predictions. We generate synthetic images by back-propagating gradients to noise images while relying on L1-pruning for the backbone pruning. In our experiments, we show that our approach is task-independent due to pruning only the backbone. By evaluating our approach on 2D human pose estimation, object detection, and image classification, we demonstrate promising performance compared to the unpruned model. Our code is available at https://github.com/holzbock/dfbf.

CVMar 16, 2023Code
Tackling Clutter in Radar Data -- Label Generation and Detection Using PointNet++

Johannes Kopp, Dominik Kellner, Aldi Piroli et al.

Radar sensors employed for environment perception, e.g. in autonomous vehicles, output a lot of unwanted clutter. These points, for which no corresponding real objects exist, are a major source of errors in following processing steps like object detection or tracking. We therefore present two novel neural network setups for identifying clutter. The input data, network architectures and training configuration are adjusted specifically for this task. Special attention is paid to the downsampling of point clouds composed of multiple sensor scans. In an extensive evaluation, the new setups display substantially better performance than existing approaches. Because there is no suitable public data set in which clutter is annotated, we design a method to automatically generate the respective labels. By applying it to existing data with object annotations and releasing its code, we effectively create the first freely available radar clutter data set representing real-world driving scenarios. Code and instructions are accessible at www.github.com/kopp-j/clutter-ds.

CVJan 9, 2023
SCENE: Reasoning about Traffic Scenes using Heterogeneous Graph Neural Networks

Thomas Monninger, Julian Schmidt, Jan Rupprecht et al.

Understanding traffic scenes requires considering heterogeneous information about dynamic agents and the static infrastructure. In this work we propose SCENE, a methodology to encode diverse traffic scenes in heterogeneous graphs and to reason about these graphs using a heterogeneous Graph Neural Network encoder and task-specific decoders. The heterogeneous graphs, whose structures are defined by an ontology, consist of different nodes with type-specific node features and different relations with type-specific edge features. In order to exploit all the information given by these graphs, we propose to use cascaded layers of graph convolution. The result is an encoding of the scene. Task-specific decoders can be applied to predict desired attributes of the scene. Extensive evaluation on two diverse binary node classification tasks show the main strength of this methodology: despite being generic, it even manages to outperform task-specific baselines. The further application of our methodology to the task of node classification in various knowledge graphs shows its transferability to other domains.

CVMay 31, 2022
Transformers for Multi-Object Tracking on Point Clouds

Felicia Ruppel, Florian Faion, Claudius Gläser et al.

We present TransMOT, a novel transformer-based end-to-end trainable online tracker and detector for point cloud data. The model utilizes a cross- and a self-attention mechanism and is applicable to lidar data in an automotive context, as well as other data types, such as radar. Both track management and the detection of new tracks are performed by the same transformer decoder module and the tracker state is encoded in feature space. With this approach, we make use of the rich latent space of the detector for tracking rather than relying on low-dimensional bounding boxes. Still, we are able to retain some of the desirable properties of traditional Kalman-filter based approaches, such as an ability to handle sensor input at arbitrary timesteps or to compensate frame skips. This is possible due to a novel module that transforms the track information from one frame to the next on feature-level and thereby fulfills a similar task as the prediction step of a Kalman filter. Results are presented on the challenging real-world dataset nuScenes, where the proposed model outperforms its Kalman filter-based tracking baseline.

CVApr 25, 2022
A Spatio-Temporal Multilayer Perceptron for Gesture Recognition

Adrian Holzbock, Alexander Tsaregorodtsev, Youssef Dawoud et al.

Gesture recognition is essential for the interaction of autonomous vehicles with humans. While the current approaches focus on combining several modalities like image features, keypoints and bone vectors, we present neural network architecture that delivers state-of-the-art results only with body skeleton input data. We propose the spatio-temporal multilayer perceptron for gesture recognition in the context of autonomous vehicles. Given 3D body poses over time, we define temporal and spatial mixing operations to extract features in both domains. Additionally, the importance of each time step is re-weighted with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. Furthermore, we deploy our model to our autonomous vehicle to show its real-time capability and stable execution.

CVFeb 13, 2023
Exploring Navigation Maps for Learning-Based Motion Prediction

Julian Schmidt, Julian Jordan, Franz Gritschneder et al.

The prediction of surrounding agents' motion is a key for safe autonomous driving. In this paper, we explore navigation maps as an alternative to the predominant High Definition (HD) maps for learning-based motion prediction. Navigation maps provide topological and geometrical information on road-level, HD maps additionally have centimeter-accurate lane-level information. As a result, HD maps are costly and time-consuming to obtain, while navigation maps with near-global coverage are freely available. We describe an approach to integrate navigation maps into learning-based motion prediction models. To exploit locally available HD maps during training, we additionally propose a model-agnostic method for knowledge distillation. In experiments on the publicly available Argoverse dataset with navigation maps obtained from OpenStreetMap, our approach shows a significant improvement over not using a map at all. Combined with our method for knowledge distillation, we achieve results that are close to the original HD map-reliant models. Our publicly available navigation map API for Argoverse enables researchers to develop and evaluate their own approaches using navigation maps.

CVMay 24, 2022
Robust 3D Object Detection in Cold Weather Conditions

Aldi Piroli, Vinzenz Dallabetta, Marc Walessa et al.

Adverse weather conditions can negatively affect LiDAR-based object detectors. In this work, we focus on the phenomenon of vehicle gas exhaust condensation in cold weather conditions. This everyday effect can influence the estimation of object sizes, orientations and introduce ghost object detections, compromising the reliability of the state of the art object detectors. We propose to solve this problem by using data augmentation and a novel training loss term. To effectively train deep neural networks, a large set of labeled data is needed. In case of adverse weather conditions, this process can be extremely laborious and expensive. We address this issue in two steps: First, we present a gas exhaust data generation method based on 3D surface reconstruction and sampling which allows us to generate large sets of gas exhaust clouds from a small pool of labeled data. Second, we introduce a point cloud augmentation process that can be used to add gas exhaust to datasets recorded in good weather conditions. Finally, we formulate a new training loss term that leverages the augmented point cloud to increase object detection robustness by penalizing predictions that include noise. In contrast to other works, our method can be used with both grid-based and point-based detectors. Moreover, since our approach does not require any network architecture changes, inference times remain unchanged. Experimental results on real data show that our proposed method greatly increases robustness to gas exhaust and noisy data.

CVJul 11, 2022
Detection of Condensed Vehicle Gas Exhaust in LiDAR Point Clouds

Aldi Piroli, Vinzenz Dallabetta, Marc Walessa et al.

LiDAR sensors used in autonomous driving applications are negatively affected by adverse weather conditions. One common, but understudied effect, is the condensation of vehicle gas exhaust in cold weather. This everyday phenomenon can severely impact the quality of LiDAR measurements, resulting in a less accurate environment perception by creating artifacts like ghost object detections. In the literature, the semantic segmentation of adverse weather effects like rain and fog is achieved using learning-based approaches. However, such methods require large sets of labeled data, which can be extremely expensive and laborious to get. We address this problem by presenting a two-step approach for the detection of condensed vehicle gas exhaust. First, we identify for each vehicle in a scene its emission area and detect gas exhaust if present. Then, isolated clouds are detected by modeling through time the regions of space where gas exhaust is likely to be present. We test our method on real urban data, showing that our approach can reliably detect gas exhaust in different scenarios, making it appealing for offline pre-labeling and online applications such as ghost object detection.

CVApr 12, 2023
RESET: Revisiting Trajectory Sets for Conditional Behavior Prediction

Julian Schmidt, Pascal Huissel, Julian Wiederer et al.

It is desirable to predict the behavior of traffic participants conditioned on different planned trajectories of the autonomous vehicle. This allows the downstream planner to estimate the impact of its decisions. Recent approaches for conditional behavior prediction rely on a regression decoder, meaning that coordinates or polynomial coefficients are regressed. In this work we revisit set-based trajectory prediction, where the probability of each trajectory in a predefined trajectory set is determined by a classification model, and first-time employ it to the task of conditional behavior prediction. We propose RESET, which combines a new metric-driven algorithm for trajectory set generation with a graph-based encoder. For unconditional prediction, RESET achieves comparable performance to a regression-based approach. Due to the nature of set-based approaches, it has the advantageous property of being able to predict a flexible number of trajectories without influencing runtime or complexity. For conditional prediction, RESET achieves reasonable results with late fusion of the planned trajectory, which was not observed for regression-based approaches before. This means that RESET is computationally lightweight to combine with a planner that proposes multiple future plans of the autonomous vehicle, as large parts of the forward pass can be reused.

CVJun 10, 2022
MEAT: Maneuver Extraction from Agent Trajectories

Julian Schmidt, Julian Jordan, David Raba et al.

Advances in learning-based trajectory prediction are enabled by large-scale datasets. However, in-depth analysis of such datasets is limited. Moreover, the evaluation of prediction models is limited to metrics averaged over all samples in the dataset. We propose an automated methodology that allows to extract maneuvers (e.g., left turn, lane change) from agent trajectories in such datasets. The methodology considers information about the agent dynamics and information about the lane segments the agent traveled along. Although it is possible to use the resulting maneuvers for training classification networks, we exemplary use them for extensive trajectory dataset analysis and maneuver-specific evaluation of multiple state-of-the-art trajectory prediction models. Additionally, an analysis of the datasets and an evaluation of the prediction models based on the agent dynamics is provided.

CVOct 16, 2023
Multimodal Object Query Initialization for 3D Object Detection

Mathijs R. van Geerenstein, Felicia Ruppel, Klaus Dietmayer et al.

3D object detection models that exploit both LiDAR and camera sensor features are top performers in large-scale autonomous driving benchmarks. A transformer is a popular network architecture used for this task, in which so-called object queries act as candidate objects. Initializing these object queries based on current sensor inputs is a common practice. For this, existing methods strongly rely on LiDAR data however, and do not fully exploit image features. Besides, they introduce significant latency. To overcome these limitations we propose EfficientQ3M, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models. The proposed initialization method is combined with a "modality-balanced" transformer decoder where the queries can access all sensor modalities throughout the decoder. In experiments, we outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization, while being more efficient than the available alternatives for LiDAR-camera initialization. The proposed method can be applied with any combination of sensor modalities as input, demonstrating its modularity.

LGAug 28, 2023
Group Regression for Query Based Object Detection and Tracking

Felicia Ruppel, Florian Faion, Claudius Gläser et al.

Group regression is commonly used in 3D object detection to predict box parameters of similar classes in a joint head, aiming to benefit from similarities while separating highly dissimilar classes. For query-based perception methods, this has, so far, not been feasible. We close this gap and present a method to incorporate multi-class group regression, especially designed for the 3D domain in the context of autonomous driving, into existing attention and query-based perception approaches. We enhance a transformer based joint object detection and tracking model with this approach, and thoroughly evaluate its behavior and performance. For group regression, the classes of the nuScenes dataset are divided into six groups of similar shape and prevalence, each being regressed by a dedicated head. We show that the proposed method is applicable to many existing transformer based perception approaches and can bring potential benefits. The behavior of query group regression is thoroughly analyzed in comparison to a unified regression head, e.g. in terms of class-switching behavior and distribution of the output parameters. The proposed method offers many possibilities for further research, such as in the direction of deep multi-hypotheses tracking.

CVFeb 20, 2023
Gesture Recognition with Keypoint and Radar Stream Fusion for Automated Vehicles

Adrian Holzbock, Nicolai Kern, Christian Waldschmidt et al.

We present a joint camera and radar approach to enable autonomous vehicles to understand and react to human gestures in everyday traffic. Initially, we process the radar data with a PointNet followed by a spatio-temporal multilayer perceptron (stMLP). Independently, the human body pose is extracted from the camera frame and processed with a separate stMLP network. We propose a fusion neural network for both modalities, including an auxiliary loss for each modality. In our experiments with a collected dataset, we show the advantages of gesture recognition with two modalities. Motivated by adverse weather conditions, we also demonstrate promising performance when one of the sensors lacks functionality.

CVOct 26, 2022
Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects?

Felicia Ruppel, Florian Faion, Claudius Gläser et al.

Transformers have recently been utilized to perform object detection and tracking in the context of autonomous driving. One unique characteristic of these models is that attention weights are computed in each forward pass, giving insights into the model's interior, in particular, which part of the input data it deemed interesting for the given task. Such an attention matrix with the input grid is available for each detected (or tracked) object in every transformer decoder layer. In this work, we investigate the distribution of these attention weights: How do they change through the decoder layers and through the lifetime of a track? Can they be used to infer additional information about an object, such as a detection uncertainty? Especially in unstructured environments, or environments that were not common during training, a reliable measure of detection uncertainty is crucial to decide whether the system can still be trusted or not.

CVSep 30, 2022
Transformers for Object Detection in Large Point Clouds

Felicia Ruppel, Florian Faion, Claudius Gläser et al.

We present TransLPC, a novel detection model for large point clouds that is based on a transformer architecture. While object detection with transformers has been an active field of research, it has proved difficult to apply such models to point clouds that span a large area, e.g. those that are common in autonomous driving, with lidar or radar data. TransLPC is able to remedy these issues: The structure of the transformer model is modified to allow for larger input sequence lengths, which are sufficient for large point clouds. Besides this, we propose a novel query refinement technique to improve detection accuracy, while retaining a memory-friendly number of transformer decoder queries. The queries are repositioned between layers, moving them closer to the bounding box they are estimating, in an efficient manner. This simple technique has a significant effect on detection accuracy, which is evaluated on the challenging nuScenes dataset on real-world lidar data. Besides this, the proposed method is compatible with existing transformer-based solutions that require object detection, e.g. for joint multi-object tracking and detection, and enables them to be used in conjunction with large point clouds.

CVOct 2, 2023
Towards Robust 3D Object Detection In Rainy Conditions

Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp et al.

LiDAR sensors are used in autonomous driving applications to accurately perceive the environment. However, they are affected by adverse weather conditions such as snow, fog, and rain. These everyday phenomena introduce unwanted noise into the measurements, severely degrading the performance of LiDAR-based perception systems. In this work, we propose a framework for improving the robustness of LiDAR-based 3D object detectors against road spray. Our approach uses a state-of-the-art adverse weather detection network to filter out spray from the LiDAR point cloud, which is then used as input for the object detector. In this way, the detected objects are less affected by the adverse weather in the scene, resulting in a more accurate perception of the environment. In addition to adverse weather filtering, we explore the use of radar targets to further filter false positive detections. Tests on real-world data show that our approach improves the robustness to road spray of several popular 3D object detectors.

CVApr 12, 2023
LMR: Lane Distance-Based Metric for Trajectory Prediction

Julian Schmidt, Thomas Monninger, Julian Jordan et al.

The development of approaches for trajectory prediction requires metrics to validate and compare their performance. Currently established metrics are based on Euclidean distance, which means that errors are weighted equally in all directions. Euclidean metrics are insufficient for structured environments like roads, since they do not properly capture the agent's intent relative to the underlying lane. In order to provide a reasonable assessment of trajectory prediction approaches with regard to the downstream planning task, we propose a new metric that is lane distance-based: Lane Miss Rate (LMR). For the calculation of LMR, the ground-truth and predicted endpoints are assigned to lane segments, more precisely their centerlines. Measured by the distance along the lane segments, predictions that are within a certain threshold distance to the ground-truth count as hits, otherwise they count as misses. LMR is then defined as the ratio of sequences that yield a miss. Our results on three state-of-the-art trajectory prediction models show that LMR preserves the order of Euclidean distance-based metrics. In contrast to the Euclidean Miss Rate, qualitative results show that LMR yields misses for sequences where predictions are located on wrong lanes. Hits on the other hand result for sequences where predictions are located on the correct lane. This means that LMR implicitly weights Euclidean error relative to the lane and goes into the direction of capturing intents of traffic agents. The source code of LMR for Argoverse 2 is publicly available.

CVOct 2, 2023
LS-VOS: Identifying Outliers in 3D Object Detections Using Latent Space Virtual Outlier Synthesis

Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp et al.

LiDAR-based 3D object detectors have achieved unprecedented speed and accuracy in autonomous driving applications. However, similar to other neural networks, they are often biased toward high-confidence predictions or return detections where no real object is present. These types of detections can lead to a less reliable environment perception, severely affecting the functionality and safety of autonomous vehicles. We address this problem by proposing LS-VOS, a framework for identifying outliers in 3D object detections. Our approach builds on the idea of Virtual Outlier Synthesis (VOS), which incorporates outlier knowledge during training, enabling the model to learn more compact decision boundaries. In particular, we propose a new synthesis approach that relies on the latent space of an auto-encoder network to generate outlier features with a parametrizable degree of similarity to in-distribution features. In extensive experiments, we show that our approach improves the outlier detection capabilities of a state-of-the-art object detector while maintaining high 3D object detection performance.

CVApr 24, 2024Code
Revisiting Out-of-Distribution Detection in LiDAR-based 3D Object Detection

Michael Kösel, Marcel Schreiber, Michael Ulrich et al.

LiDAR-based 3D object detection has become an essential part of automated driving due to its ability to localize and classify objects precisely in 3D. However, object detectors face a critical challenge when dealing with unknown foreground objects, particularly those that were not present in their original training data. These out-of-distribution (OOD) objects can lead to misclassifications, posing a significant risk to the safety and reliability of automated vehicles. Currently, LiDAR-based OOD object detection has not been well studied. We address this problem by generating synthetic training data for OOD objects by perturbing known object categories. Our idea is that these synthetic OOD objects produce different responses in the feature map of an object detector compared to in-distribution (ID) objects. We then extract features using a pre-trained and fixed object detector and train a simple multilayer perceptron (MLP) to classify each detection as either ID or OOD. In addition, we propose a new evaluation protocol that allows the use of existing datasets without modifying the point cloud, ensuring a more authentic evaluation of real-world scenarios. The effectiveness of our method is validated through experiments on the newly proposed nuScenes OOD benchmark. The source code is available at https://github.com/uulm-mrm/mmood3d.

CVNov 18, 2024Code
The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather

Markus Schön, Jona Ruof, Thomas Wodtko et al.

Depth estimation is an essential task toward full scene understanding since it allows the projection of rich semantic information captured by cameras into 3D space. While the field has gained much attention recently, datasets for depth estimation lack scene diversity or sensor modalities. This work presents the ADUULM-360 dataset, a novel multi-modal dataset for depth estimation. The ADUULM-360 dataset covers all established autonomous driving sensor modalities, cameras, lidars, and radars. It covers a frontal-facing stereo setup, six surround cameras covering the full 360-degree, two high-resolution long-range lidar sensors, and five long-range radar sensors. It is also the first depth estimation dataset that contains diverse scenes in good and adverse weather conditions. We conduct extensive experiments using state-of-the-art self-supervised depth estimation methods under different training tasks, such as monocular training, stereo training, and full surround training. Discussing these results, we demonstrate common limitations of state-of-the-art methods, especially in adverse weather conditions, which hopefully will inspire future research in this area. Our dataset, development kit, and trained baselines are available at https://github.com/uulm-mrm/aduulm_360_dataset.

CVNov 13, 2023
Simultaneous Clutter Detection and Semantic Segmentation of Moving Objects for Automotive Radar Data

Johannes Kopp, Dominik Kellner, Aldi Piroli et al.

The unique properties of radar sensors, such as their robustness to adverse weather conditions, make them an important part of the environment perception system of autonomous vehicles. One of the first steps during the processing of radar point clouds is often the detection of clutter, i.e. erroneous points that do not correspond to real objects. Another common objective is the semantic segmentation of moving road users. These two problems are handled strictly separate from each other in literature. The employed neural networks are always focused entirely on only one of the tasks. In contrast to this, we examine ways to solve both tasks at the same time with a single jointly used model. In addition to a new augmented multi-head architecture, we also devise a method to represent a network's predictions for the two tasks with only one output value. This novel approach allows us to solve the tasks simultaneously with the same inference time as a conventional task-specific model. In an extensive evaluation, we show that our setup is highly effective and outperforms every existing network for semantic segmentation on the RadarScenes dataset.

CVMar 9Code
ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection

Michael Kösel, Marcel Schreiber, Michael Ulrich et al.

LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm-mrm/mmood3d.

CVNov 18, 2024Code
MGNiceNet: Unified Monocular Geometric Scene Understanding

Markus Schön, Michael Buchholz, Klaus Dietmayer

Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at https://github.com/markusschoen/MGNiceNet.

CVMay 2, 2023Code
RT-K-Net: Revisiting K-Net for Real-Time Panoptic Segmentation

Markus Schön, Michael Buchholz, Klaus Dietmayer

Panoptic segmentation is one of the most challenging scene parsing tasks, combining the tasks of semantic segmentation and instance segmentation. While much progress has been made, few works focus on the real-time application of panoptic segmentation methods. In this paper, we revisit the recently introduced K-Net architecture. We propose vital changes to the architecture, training, and inference procedure, which massively decrease latency and improve performance. Our resulting RT-K-Net sets a new state-of-the-art performance for real-time panoptic segmentation methods on the Cityscapes dataset and shows promising results on the challenging Mapillary Vistas dataset. On Cityscapes, RT-K-Net reaches 60.2 % PQ with an average inference time of 32 ms for full resolution 1024x2048 pixel images on a single Titan RTX GPU. On Mapillary Vistas, RT-K-Net reaches 33.2 % PQ with an average inference time of 69 ms. Source code is available at https://github.com/markusschoen/RT-K-Net.

CVNov 20, 2020Code
A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

Di Feng, Ali Harakeh, Steven Waslander et al.

Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at https://github.com/asharakeh/pod_compare.git

CVNov 2, 2020Code
Point Transformer

Nico Engel, Vasileios Belagiannis, Klaus Dietmayer

In this work, we present Point Transformer, a deep neural network that operates directly on unordered and unstructured point sets. We design Point Transformer to extract local and global features and relate both representations by introducing the local-global attention mechanism, which aims to capture spatial point relations and shape information. For that purpose, we propose SortNet, as part of the Point Transformer, which induces input permutation invariance by selecting points based on a learned score. The output of Point Transformer is a sorted and permutation invariant feature list that can directly be incorporated into common computer vision applications. We evaluate our approach on standard classification and part segmentation benchmarks to demonstrate competitive results compared to the prior work. Code is publicly available at: https://github.com/engelnico/point-transformer

CVJun 21, 2019Code
Pixel-Accurate Depth Evaluation in Realistic Driving Scenarios

Tobias Gruber, Mario Bijelic, Felix Heide et al.

This work introduces an evaluation benchmark for depth estimation and completion using high-resolution depth measurements with angular resolution of up to 25" (arcsecond), akin to a 50 megapixel camera with per-pixel depth available. Existing datasets, such as the KITTI benchmark, provide only sparse reference measurements with an order of magnitude lower angular resolution - these sparse measurements are treated as ground truth by existing depth estimation methods. We propose an evaluation methodology in four characteristic automotive scenarios recorded in varying weather conditions (day, night, fog, rain). As a result, our benchmark allows us to evaluate the robustness of depth sensing methods in adverse weather and different driving conditions. Using the proposed evaluation data, we demonstrate that current stereo approaches provide significantly more stable depth estimates than monocular methods and lidar completion in adverse weather. Data and code are available at https://github.com/gruberto/PixelAccurateDepthBenchmark.git.

CVFeb 24, 2019Code
Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather

Mario Bijelic, Tobias Gruber, Fahim Mannan et al.

The fusion of multimodal sensor streams, such as camera, lidar, and radar measurements, plays a critical role in object detection for autonomous vehicles, which base their decision making on these inputs. While existing methods exploit redundant information in good environmental conditions, they fail in adverse weather where the sensory streams can be asymmetrically distorted. These rare "edge-case" scenarios are not represented in available datasets, and existing fusion architectures are not designed to handle them. To address this challenge we present a novel multimodal dataset acquired in over 10,000km of driving in northern Europe. Although this dataset is the first large multimodal dataset in adverse weather, with 100k labels for lidar, camera, radar, and gated NIR sensors, it does not facilitate training as extreme weather is rare. To this end, we present a deep fusion network for robust fusion without a large corpus of labeled training data covering all asymmetric distortions. Departing from proposal-level fusion, we propose a single-shot model that adaptively fuses features, driven by measurement entropy. We validate the proposed method, trained on clean data, on our extensive validation dataset. Code and data are available here https://github.com/princeton-computational-imaging/SeeingThroughFog.

CVFeb 13, 2019Code
Gated2Depth: Real-time Dense Lidar from Gated Images

Tobias Gruber, Frank Julca-Aguilar, Mario Bijelic et al.

We present an imaging framework which converts three images from a gated camera into high-resolution depth maps with depth accuracy comparable to pulsed lidar measurements. Existing scanning lidar systems achieve low spatial resolution at large ranges due to mechanically-limited angular sampling rates, restricting scene understanding tasks to close-range clusters with dense sampling. Moreover, today's pulsed lidar scanners suffer from high cost, power consumption, large form-factors, and they fail in the presence of strong backscatter. We depart from point scanning and demonstrate that it is possible to turn a low-cost CMOS gated imager into a dense depth camera with at least 80m range - by learning depth from three gated images. The proposed architecture exploits semantic context across gated slices, and is trained on a synthetic discriminator loss without the need of dense depth labels. The proposed replacement for scanning lidar systems is real-time, handles back-scatter and provides dense depth at long ranges. We validate our approach in simulation and on real-world data acquired over 4,000km driving in northern Europe. Data and code are available at https://github.com/gruberto/Gated2Depth.

CVMay 17, 2018Code
Disparity Sliding Window: Object Proposals From Disparity Images

Julian Müller, Andreas Fregin, Klaus Dietmayer

Sliding window approaches have been widely used for object recognition tasks in recent years. They guarantee an investigation of the entire input image for the object to be detected and allow a localization of that object. Despite the current trend towards deep neural networks, sliding window methods are still used in combination with convolutional neural networks. The risk of overlooking an object is clearly reduced compared to alternative detection approaches which detect objects based on shape, edges or color. Nevertheless, the sliding window technique strongly increases the computational effort as the classifier has to verify a large number of object candidates. This paper proposes a sliding window approach which also uses depth information from a stereo camera. This leads to a greatly decreased number of object candidates without significantly reducing the detection accuracy. A theoretical investigation of the conventional sliding window approach is presented first. Other publications to date only mentioned rough estimations of the computational cost. A mathematical derivation clarifies the number of object candidates with respect to parameters such as image and object size. Subsequently, the proposed disparity sliding window approach is presented in detail. The approach is evaluated on pedestrian detection with annotations and images from the KITTI object detection benchmark. Furthermore, a comparison with two state-of-the-art methods is made. Code is available in C++ and Python https://github.com/julimueller/ disparity-sliding-window.

CVMay 7, 2018Code
Detecting Traffic Lights by Single Shot Detection

Julian Müller, Klaus Dietmayer

Recent improvements in object detection are driven by the success of convolutional neural networks (CNN). They are able to learn rich features outperforming hand-crafted features. So far, research in traffic light detection mainly focused on hand-crafted features, such as color, shape or brightness of the traffic light bulb. This paper presents a deep learning approach for accurate traffic light detection in adapting a single shot detection (SSD) approach. SSD performs object proposals creation and classification using a single CNN. The original SSD struggles in detecting very small objects, which is essential for traffic light detection. By our adaptations it is possible to detect objects much smaller than ten pixels without increasing the input image size. We present an extensive evaluation on the DriveU Traffic Light Dataset (DTLD). We reach both, high accuracy and low false positive rates. The trained model is real-time capable with ten frames per second on a Nvidia Titan Xp. Code has been made available at https://github.com/julimueller/tl_ssd.

CVApr 1, 2024
The Radar Ghost Dataset -- An Evaluation of Ghost Objects in Automotive Radar Data

Florian Kraus, Nicolas Scheiner, Werner Ritter et al.

Radar sensors have a long tradition in advanced driver assistance systems (ADAS) and also play a major role in current concepts for autonomous vehicles. Their importance is reasoned by their high robustness against meteorological effects, such as rain, snow, or fog, and the radar's ability to measure relative radial velocity differences via the Doppler effect. The cause for these advantages, namely the large wavelength, is also one of the drawbacks of radar sensors. Compared to camera or lidar sensor, a lot more surfaces in a typical traffic scenario appear flat relative to the radar's emitted signal. This results in multi-path reflections or so called ghost detections in the radar signal. Ghost objects pose a major source for potential false positive detections in a vehicle's perception pipeline. Therefore, it is important to be able to segregate multi-path reflections from direct ones. In this article, we present a dataset with detailed manual annotations for different kinds of ghost detections. Moreover, two different approaches for identifying these kinds of objects are evaluated. We hope that our dataset encourages more researchers to engage in the fields of multi-path object suppression or exploitation.

CYJul 24, 2025
A Concept for Efficient Scalability of Automated Driving Allowing for Technical, Legal, Cultural, and Ethical Differences

Lars Ullrich, Michael Buchholz, Jonathan Petit et al.

Efficient scalability of automated driving (AD) is key to reducing costs, enhancing safety, conserving resources, and maximizing impact. However, research focuses on specific vehicles and context, while broad deployment requires scalability across various configurations and environments. Differences in vehicle types, sensors, actuators, but also traffic regulations, legal requirements, cultural dynamics, or even ethical paradigms demand high flexibility of data-driven developed capabilities. In this paper, we address the challenge of scalable adaptation of generic capabilities to desired systems and environments. Our concept follows a two-stage fine-tuning process. In the first stage, fine-tuning to the specific environment takes place through a country-specific reward model that serves as an interface between technological adaptations and socio-political requirements. In the second stage, vehicle-specific transfer learning facilitates system adaptation and governs the validation of design decisions. In sum, our concept offers a data-driven process that integrates both technological and socio-political aspects, enabling effective scalability across technical, legal, cultural, and ethical differences.

CVJun 14, 2024
SemanticSpray++: A Multimodal Dataset for Autonomous Driving in Wet Surface Conditions

Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp et al.

Autonomous vehicles rely on camera, LiDAR, and radar sensors to navigate the environment. Adverse weather conditions like snow, rain, and fog are known to be problematic for both camera and LiDAR-based perception systems. Currently, it is difficult to evaluate the performance of these methods due to the lack of publicly available datasets containing multimodal labeled data. To address this limitation, we propose the SemanticSpray++ dataset, which provides labels for camera, LiDAR, and radar data of highway-like scenarios in wet surface conditions. In particular, we provide 2D bounding boxes for the camera image, 3D bounding boxes for the LiDAR point cloud, and semantic labels for the radar targets. By labeling all three sensor modalities, the SemanticSpray++ dataset offers a comprehensive test bed for analyzing the performance of different perception methods when vehicles travel on wet surface conditions. Together with comprehensive label statistics, we also evaluate multiple baseline methods across different tasks and analyze their performances. The dataset will be available at https://semantic-spray-dataset.github.io .

CVJun 14, 2024
Label-Efficient Semantic Segmentation of LiDAR Point Clouds in Adverse Weather Conditions

Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp et al.

Adverse weather conditions can severely affect the performance of LiDAR sensors by introducing unwanted noise in the measurements. Therefore, differentiating between noise and valid points is crucial for the reliable use of these sensors. Current approaches for detecting adverse weather points require large amounts of labeled data, which can be difficult and expensive to obtain. This paper proposes a label-efficient approach to segment LiDAR point clouds in adverse weather. We develop a framework that uses few-shot semantic segmentation to learn to segment adverse weather points from only a few labeled examples. Then, we use a semi-supervised learning approach to generate pseudo-labels for unlabelled point clouds, significantly increasing the amount of training data without requiring any additional labeling. We also integrate good weather data in our training pipeline, allowing for high performance in both good and adverse weather conditions. Results on real and synthetic datasets show that our method performs well in detecting snow, fog, and spray. Furthermore, we achieve competitive performance against fully supervised methods while using only a fraction of labeled data.

CVMay 25, 2023
Energy-based Detection of Adverse Weather Effects in LiDAR Data

Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp et al.

Autonomous vehicles rely on LiDAR sensors to perceive the environment. Adverse weather conditions like rain, snow, and fog negatively affect these sensors, reducing their reliability by introducing unwanted noise in the measurements. In this work, we tackle this problem by proposing a novel approach for detecting adverse weather effects in LiDAR data. We reformulate this problem as an outlier detection task and use an energy-based framework to detect outliers in point clouds. More specifically, our method learns to associate low energy scores with inlier points and high energy scores with outliers allowing for robust detection of adverse weather effects. In extensive experiments, we show that our method performs better in adverse weather detection and has higher robustness to unseen weather effects than previous state-of-the-art methods. Furthermore, we show how our method can be used to perform simultaneous outlier detection and semantic segmentation. Finally, to help expand the research field of LiDAR perception in adverse weather, we release the SemanticSpray dataset, which contains labeled vehicle spray data in highway-like scenarios. The dataset is available at https://semantic-spray-dataset.github.io .

CVFeb 9, 2022
CRAT-Pred: Vehicle Trajectory Prediction with Crystal Graph Convolutional Neural Networks and Multi-Head Self-Attention

Julian Schmidt, Julian Jordan, Franz Gritschneder et al.

Predicting the motion of surrounding vehicles is essential for autonomous vehicles, as it governs their own motion plan. Current state-of-the-art vehicle prediction models heavily rely on map information. In reality, however, this information is not always available. We therefore propose CRAT-Pred, a multi-modal and non-rasterization-based trajectory prediction model, specifically designed to effectively model social interactions between vehicles, without relying on map information. CRAT-Pred applies a graph convolution method originating from the field of material science to vehicle prediction, allowing to efficiently leverage edge features, and combines it with multi-head self-attention. Compared to other map-free approaches, the model achieves state-of-the-art performance with a significantly lower number of model parameters. In addition to that, we quantitatively show that the self-attention mechanism is able to learn social interactions between vehicles, with the weights representing a measurable interaction score. The source code is publicly available.

ROFeb 9, 2022
A Multi-Task Recurrent Neural Network for End-to-End Dynamic Occupancy Grid Mapping

Marcel Schreiber, Vasileios Belagiannis, Claudius Gläser et al.

A common approach for modeling the environment of an autonomous vehicle are dynamic occupancy grid maps, in which the surrounding is divided into cells, each containing the occupancy and velocity state of its location. Despite the advantage of modeling arbitrary shaped objects, the used algorithms rely on hand-designed inverse sensor models and semantic information is missing. Therefore, we introduce a multi-task recurrent neural network to predict grid maps providing occupancies, velocity estimates, semantic information and the driveable area. During training, our network architecture, which is a combination of convolutional and recurrent layers, processes sequences of raw lidar data, that is represented as bird's eye view images with several height channels. The multi-task network is trained in an end-to-end fashion to predict occupancy grid maps without the usual preprocessing steps consisting of removing ground points and applying an inverse sensor model. In our evaluations, we show that our learned inverse sensor model is able to overcome some limitations of a geometric inverse sensor model in terms of representing object shapes and modeling freespace. Moreover, we report a better runtime performance and more accurate semantic predictions for our end-to-end approach, compared to our network relying on measurement grid maps as input data.

ROJan 12, 2022
Globally Optimal Multi-Scale Monocular Hand-Eye Calibration Using Dual Quaternions

Thomas Wodtko, Markus Horn, Michael Buchholz et al.

In this work, we present an approach for monocular hand-eye calibration from per-sensor ego-motion based on dual quaternions. Due to non-metrically scaled translations of monocular odometry, a scaling factor has to be estimated in addition to the rotation and translation calibration. For this, we derive a quadratically constrained quadratic program that allows a combined estimation of all extrinsic calibration parameters. Using dual quaternions leads to low run-times due to their compact representation. Our problem formulation further allows to estimate multiple scalings simultaneously for different sequences of the same sensor setup. Based on our problem formulation, we derive both, a fast local and a globally optimal solving approach. Finally, our algorithms are evaluated and compared to state-of-the-art approaches on simulated and real-world data, e.g., the EuRoC MAV dataset.

RODec 2, 2021
Situation-Aware Environment Perception Using a Multi-Layer Attention Map

Matti Henning, Johannes Müller, Fabian Gies et al.

Within the field of automated driving, a clear trend in environment perception tends towards more sensors, higher redundancy, and overall increase in computational power. This is mainly driven by the paradigm to perceive the entire environment as best as possible at all times. However, due to the ongoing rise in functional complexity, compromises have to be considered to ensure real-time capabilities of the perception system. In this work, we introduce a concept for situation-aware environment perception to control the resource allocation towards processing relevant areas within the data as well as towards employing only a subset of functional modules for environment perception, if sufficient for the current driving task. Specifically, we propose to evaluate the context of an automated vehicle to derive a multi-layer attention map (MLAM) that defines relevant areas. Using this MLAM, the optimum of active functional modules is dynamically configured and intra-module processing of only relevant data is enforced. We outline the feasibility of application of our concept using real-world data in a straight-forward implementation for our system at hand. While retaining overall functionality, we achieve a reduction of accumulated processing time of 59%.

CVSep 20, 2021
Anomaly Detection in Radar Data Using PointNets

Thomas Griebel, Dominik Authaler, Markus Horn et al.

For autonomous driving, radar is an important sensor type. On the one hand, radar offers a direct measurement of the radial velocity of targets in the environment. On the other hand, in literature, radar sensors are known for their robustness against several kinds of adverse weather conditions. However, on the downside, radar is susceptible to ghost targets or clutter which can be caused by several different causes, e.g., reflective surfaces in the environment. Ghost targets, for instance, can result in erroneous object detections. To this end, it is desirable to identify anomalous targets as early as possible in radar data. In this work, we present an approach based on PointNets to detect anomalous radar targets. Modifying the PointNet-architecture driven by our task, we developed a novel grouping variant which contributes to a multi-form grouping module. Our method is evaluated on a real-world dataset in urban scenarios and shows promising results for the detection of anomalous radar targets.

CVAug 27, 2021
Fast Rule-Based Clutter Detection in Automotive Radar Data

Johannes Kopp, Dominik Kellner, Aldi Piroli et al.

Automotive radar sensors output a lot of unwanted clutter or ghost detections, whose position and velocity do not correspond to any real object in the sensor's field of view. This poses a substantial challenge for environment perception methods like object detection or tracking. Especially problematic are clutter detections that occur in groups or at similar locations in multiple consecutive measurements. In this paper, a new algorithm for identifying such erroneous detections is presented. It is mainly based on the modeling of specific commonly occurring wave propagation paths that lead to clutter. In particular, the three effects explicitly covered are reflections at the underbody of a car or truck, signals traveling back and forth between the vehicle on which the sensor is mounted and another object, and multipath propagation via specular reflection. The latter often occurs near guardrails, concrete walls or similar reflective surfaces. Each of these effects is described both theoretically and regarding a method for identifying the corresponding clutter detections. Identification is done by analyzing detections generated from a single sensor measurement only. The final algorithm is evaluated on recordings of real extra-urban traffic. For labeling, a semi-automatic process is employed. The results are promising, both in terms of performance and regarding the very low execution time. Typically, a large part of clutter is found, while only a small ratio of detections corresponding to real objects are falsely classified by the algorithm.

CVJul 19, 2021
GenRadar: Self-supervised Probabilistic Camera Synthesis based on Radar Frequencies

Carsten Ditzel, Klaus Dietmayer

Autonomous systems require a continuous and dependable environment perception for navigation and decision-making, which is best achieved by combining different sensor types. Radar continues to function robustly in compromised circumstances in which cameras become impaired, guaranteeing a steady inflow of information. Yet, camera images provide a more intuitive and readily applicable impression of the world. This work combines the complementary strengths of both sensor types in a unique self-learning fusion approach for a probabilistic scene reconstruction in adverse surrounding conditions. After reducing the memory requirements of both high-dimensional measurements through a decoupled stochastic self-supervised compression technique, the proposed algorithm exploits similarities and establishes correspondences between both domains at different feature levels during training. Then, at inference time, relying exclusively on radio frequencies, the model successively predicts camera constituents in an autoregressive and self-contained process. These discrete tokens are finally transformed back into an instructive view of the respective surrounding, allowing to visually perceive potential dangers for important tasks downstream.

CVJul 16, 2021
Attention-based Vehicle Self-Localization with HD Feature Maps

Nico Engel, Vasileios Belagiannis, Klaus Dietmayer

We present a vehicle self-localization method using point-based deep neural networks. Our approach processes measurements and point features, i.e. landmarks, from a high-definition digital map to infer the vehicle's pose. To learn the best association and incorporate local information between the point sets, we propose an attention mechanism that matches the measurements to the corresponding landmarks. Finally, we use this representation for the point-cloud registration and the subsequent pose regression task. Furthermore, we introduce a training simulation framework that artificially generates measurements and landmarks to facilitate the deployment process and reduce the cost of creating extensive datasets from real-world data. We evaluate our method on our dataset, as well as an adapted version of the Kitti odometry dataset, where we achieve superior performance compared to related approaches; and additionally show dominant generalization capabilities.

CVMar 3, 2021
Motion Classification and Height Estimation of Pedestrians Using Sparse Radar Data

Markus Horn, Ole Schumann, Markus Hahn et al.

A complete overview of the surrounding vehicle environment is important for driver assistance systems and highly autonomous driving. Fusing results of multiple sensor types like camera, radar and lidar is crucial for increasing the robustness. The detection and classification of objects like cars, bicycles or pedestrians has been analyzed in the past for many sensor types. Beyond that, it is also helpful to refine these classes and distinguish for example between different pedestrian types or activities. This task is usually performed on camera data, though recent developments are based on radar spectrograms. However, for most automotive radar systems, it is only possible to obtain radar targets instead of the original spectrograms. This work demonstrates that it is possible to estimate the body height of walking pedestrians using 2D radar targets. Furthermore, different pedestrian motion types are classified.

ROFeb 15, 2021
Graph-based Motion Planning for Automated Vehicles using Multi-model Branching and Admissible Heuristics

Oliver Speidel, Jona Ruof, Klaus Dietmayer

Automated driving in urban scenarios requires efficient planning algorithms able to handle complex situations in real-time. A popular approach is to use graph-based planning methods in order to obtain a rough trajectory which is subsequently optimized. A key aspect is the generation of trajectories implementing comfortable and safe behavior already during graph-search while keeping computation times low. To capture this aspect, on the one hand, a branching strategy is presented in this work that leads to better performance in terms of quality of resulting trajectories and runtime. On the other hand, admissible heuristics are shown which guide the graph-search efficiently, where the solution remains optimal.

ROJan 27, 2021
Online Extrinsic Calibration based on Per-Sensor Ego-Motion Using Dual Quaternions

Markus Horn, Thomas Wodtko, Michael Buchholz et al.

In this work, we propose an approach for extrinsic sensor calibration from per-sensor ego-motion estimates. Our problem formulation is based on dual quaternions, enabling two different online capable solving approaches. We provide a certifiable globally optimal and a fast local approach along with a method to verify the globality of the local approach. Additionally, means for integrating previous knowledge, for example, a common ground plane for planar sensor motion, are described. Our algorithms are evaluated on simulated data and on a publicly available dataset containing RGB-D camera images. Further, our online calibration approach is tested on the KITTI odometry dataset, which provides data of a lidar and two stereo camera systems mounted on a vehicle. Our evaluation confirms the short run time, state-of-the-art accuracy, as well as online capability of our approach while retaining the global optimality of the solution at any time.

CVDec 18, 2020
Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection

Di Feng, Zining Wang, Yiyang Zhou et al.

The availability of many real-world driving datasets is a key reason behind the recent progress of object detection algorithms in autonomous driving. However, there exist ambiguity or even failures in object labels due to error-prone annotation process or sensor observation noise. Current public object detection datasets only provide deterministic object labels without considering their inherent uncertainty, as does the common training process or evaluation metrics for object detectors. As a result, an in-depth evaluation among different object detection methods remains challenging, and the training process of object detectors is sub-optimal, especially in probabilistic object detection. In this work, we infer the uncertainty in bounding box labels from LiDAR point clouds based on a generative model, and define a new representation of the probabilistic bounding box through a spatial uncertainty distribution. Comprehensive experiments show that the proposed model reflects complex environmental noises in LiDAR perception and the label quality. Furthermore, we propose Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty. We conduct an in-depth comparison among several LiDAR-based object detectors using the JIoU metric. Finally, we incorporate the proposed label uncertainty in a loss function to train a probabilistic object detector and to improve its detection accuracy. We verify our proposed methods on two public datasets (KITTI, Waymo), as well as on simulation data. Code is released at https://bit.ly/2W534yo.