Di Feng

h-index27

27papers

2,500citations

Novelty47%

AI Score58

Ranked #13,947 of 201,326 authors (top 7%)#6,399 in CV (top 11%)

27 Papers

LGJul 22, 2024Code

LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies

Jia Shi, Gautam Gare, Jinjin Tian et al.

We tackle the challenge of predicting models' Out-of-Distribution (OOD) performance using in-distribution (ID) measurements without requiring OOD data. Existing evaluations with "Effective Robustness", which use ID accuracy as an indicator of OOD accuracy, encounter limitations when models are trained with diverse supervision and distributions, such as class labels (Vision Models, VMs, on ImageNet) and textual descriptions (Visual-Language Models, VLMs, on LAION). VLMs often generalize better to OOD data than VMs despite having similar or lower ID performance. To improve the prediction of models' OOD performance from ID measurements, we introduce the Lowest Common Ancestor (LCA)-on-the-Line framework. This approach revisits the established concept of LCA distance, which measures the hierarchical distance between labels and predictions within a predefined class hierarchy, such as WordNet. We assess 75 models using ImageNet as the ID dataset and five significantly shifted OOD variants, uncovering a strong linear correlation between ID LCA distance and OOD top-1 accuracy. Our method provides a compelling alternative for understanding why VLMs tend to generalize better. Additionally, we propose a technique to construct a taxonomic hierarchy on any dataset using K-means clustering, demonstrating that LCA distance is robust to the constructed taxonomic hierarchy. Moreover, we demonstrate that aligning model predictions with class taxonomies, through soft labels or prompt engineering, can enhance model generalization. Open source code in our Project Page: https://elvishelvis.github.io/papers/lca/.

CLJan 23Code

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Yuzhen Shi, Huanghai Liu, Yiran Hu et al.

As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.

CVSep 26, 2022

DeepFusion: A Robust and Modular 3D Object Detector for Lidars, Cameras and Radars

Florian Drews, Di Feng, Florian Faion et al.

We propose DeepFusion, a modular multi-modal architecture to fuse lidars, cameras and radars in different combinations for 3D object detection. Specialized feature extractors take advantage of each modality and can be exchanged easily, making the approach simple and flexible. Extracted features are transformed into bird's-eye-view as a common representation for fusion. Spatial and semantic alignment is performed prior to fusing modalities in the feature space. Finally, a detection head exploits rich multi-modal features for improved 3D detection performance. Experimental results for lidar-camera, lidar-camera-radar and camera-radar fusion show the flexibility and effectiveness of our fusion approach. In the process, we study the largely unexplored task of faraway car detection up to 225 meters, showing the benefits of our lidar-camera fusion. Furthermore, we investigate the required density of lidar points for 3D object detection and illustrate implications at the example of robustness against adverse weather conditions. Moreover, ablation studies on our camera-radar fusion highlight the importance of accurate depth estimation.

GTApr 13

The Price of Ignorance: Information-Free Quotation for Data Retention in Machine Unlearning

Bin Han, Di Feng, Zexin Fang et al.

When users exercise data deletion rights under the General Data Protection Regulation (GDPR) and similar regulations, mobile network operators face a tradeoff: excessive machine unlearning degrades model accuracy and incurs retraining costs, yet existing pricing mechanisms for data retention require the server to know every user's private privacy and accuracy preferences, which is infeasible under the very regulations that motivate unlearning. We ask: what is the welfare cost of operating without this private information? We design an information-free ascending quotation mechanism where the server broadcasts progressively higher prices and users self-select their data supply, requiring no knowledge of users' parameters. Under complete information, the protocol admits a unique subgame-perfect Nash equilibrium characterized by single-period selling. We formalize the Price of Ignorance -- the welfare gap between optimal personalized pricing (which knows everything) and our information-free quotation (which knows nothing) -- and prove a three-regime efficiency ordering. Numerical evaluation across seven mechanisms and 5000 Monte Carlo runs shows that this price is near zero: the information-free mechanism achieves >=99% of the welfare of its information-intensive benchmarks, while providing noise-robust guarantees and comparable fairness.

CVApr 21, 2022

Understanding the Domain Gap in LiDAR Object Detection Networks

Jasmine Richter, Florian Faion, Di Feng et al.

In order to make autonomous driving a reality, artificial neural networks have to work reliably in the open-world. However, the open-world is vast and continuously changing, so it is not technically feasible to collect and annotate training datasets which accurately represent this domain. Therefore, there are always domain gaps between training datasets and the open-world which must be understood. In this work, we investigate the domain gaps between high-resolution and low-resolution LiDAR sensors in object detection networks. Using a unique dataset, which enables us to study sensor resolution domain gaps independent of other effects, we show two distinct domain gaps - an inference domain gap and a training domain gap. The inference domain gap is characterised by a strong dependence on the number of LiDAR points per object, while the training gap shows no such dependence. These fndings show that different approaches are required to close these inference and training domain gaps.

CVJan 31, 2023

Priors are Powerful: Improving a Transformer for Multi-camera 3D Detection with 2D Priors

Di Feng, Francesco Ferroni

Transfomer-based approaches advance the recent development of multi-camera 3D detection both in academia and industry. In a vanilla transformer architecture, queries are randomly initialised and optimised for the whole dataset, without considering the differences among input frames. In this work, we propose to leverage the predictions from an image backbone, which is often highly optimised for 2D tasks, as priors to the transformer part of a 3D detection network. The method works by (1). augmenting image feature maps with 2D priors, (2). sampling query locations via ray-casting along 2D box centroids, as well as (3). initialising query features with object-level image features. Experimental results shows that 2D priors not only help the model converge faster, but also largely improve the baseline approach by up to 12% in terms of average precision.

CVOct 13, 2023

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

Feng Jiang, Chaoping Tu, Gang Zhang et al.

LiDAR and camera are two critical sensors for multi-modal 3D semantic segmentation and are supposed to be fused efficiently and robustly to promise safety in various real-world scenarios. However, existing multi-modal methods face two key challenges: 1) difficulty with efficient deployment and real-time execution; and 2) drastic performance degradation under weak calibration between LiDAR and cameras. To address these challenges, we propose CPGNet-LCF, a new multi-modal fusion framework extending the LiDAR-only CPGNet. CPGNet-LCF solves the first challenge by inheriting the easy deployment and real-time capabilities of CPGNet. For the second challenge, we introduce a novel weak calibration knowledge distillation strategy during training to improve the robustness against the weak calibration. CPGNet-LCF achieves state-of-the-art performance on the nuScenes and SemanticKITTI benchmarks. Remarkably, it can be easily deployed to run in 20ms per frame on a single Tesla V100 GPU using TensorRT TF16 mode. Furthermore, we benchmark performance over four weak calibration levels, demonstrating the robustness of our proposed approach.

CVNov 23, 2025Code

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Di Feng, Kaixin Ma, Feng Nan et al.

Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.

CVMar 6, 2021Code

A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding

Di Feng, Yiyang Zhou, Chenfeng Xu et al.

Detecting dynamic objects and predicting static road information such as drivable areas and ground heights are crucial for safe autonomous driving. Previous works studied each perception task separately, and lacked a collective quantitative analysis. In this work, we show that it is possible to perform all perception tasks via a simple and efficient multi-task network. Our proposed network, LidarMTL, takes raw LiDAR point cloud as inputs, and predicts six perception outputs for 3D object detection and road understanding. The network is based on an encoder-decoder architecture with 3D sparse convolution and deconvolution operations. Extensive experiments verify the proposed method with competitive accuracies compared to state-of-the-art object detectors and other task-specific networks. LidarMTL is also leveraged for online localization. Code and pre-trained model have been made available at https://github.com/frankfengdi/LidarMTL.

CVNov 20, 2020Code

A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

Di Feng, Ali Harakeh, Steven Waslander et al.

Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at https://github.com/asharakeh/pod_compare.git

GTMay 8

Incentivizing User Data Contributions for LLM Improvement under Withdrawal Rights

Di Feng, Chenhao Zhang, Zhanzhan Zhao

The continued improvement of large language models (LLMs) increasingly depends on eliciting high-quality, user-generated data, yet such data are costly to provide and often withheld due to privacy and effort concerns. This creates a fundamental design challenge: how to incentivize data contribution when model improvements require coordinated, threshold-level inputs, while contributions remain privately costly and partially reversible. We develop and theoretically analyze incentive mechanisms for user data contribution that explicitly account for threshold effects and reversibility, focusing on how subsidies and withdrawal rights can be jointly designed to overcome coordination failure. As a natural benchmark, we first consider subsidy-based incentives, under which users respond to posted payments with privately optimal floor contributions. These decentralized responses may fall below the improvement threshold, resulting in subsidy expenditure without model improvements. We then analyze mechanisms with withdrawal rights, in which users report costs, the provider centrally assigns contribution burdens, and users may withdraw before training. We prove that combining cost reporting with personalized assignment can eliminate inefficient provision by ensuring that data are collected only when improvement is sustainable, converting infeasible instances into a null outcome rather than subsidy leakage. Finally, we compare two withdrawal protocols. The simultaneous protocol can achieve lower total cost, while the small-first sequential protocol better incentivizes participation, encouraging greater data provision and thereby increasing the probability of crossing the improvement threshold.

CVMar 5, 2024

FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View

Jiawei Hou, Xiaoyan Li, Wenhao Guan et al.

In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes compared with traditional perception tasks, such as 3D object detection and bird's-eye view (BEV) semantic segmentation. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design, aiming to achieve superior performance. However, the inference speed, crucial for running on an autonomous vehicle, is neglected. To this end, a new method, dubbed FastOcc, is proposed. By carefully analyzing the network effect and latency from four parts, including the input image resolution, image backbone, view transformation, and occupancy prediction head, it is found that the occupancy prediction head holds considerable potential for accelerating the model while keeping its accuracy. Targeted at improving this component, the time-consuming 3D convolution network is replaced with a novel residual-like architecture, where features are mainly digested by a lightweight 2D BEV convolution network and compensated by integrating the 3D voxel features interpolated from the original image features. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves state-of-the-art results with a fast inference speed.

CVOct 24, 2024

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Zhangheng Li, Keen You, Haotian Zhang et al. · apple-ml, gatech

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

LGJul 17, 2025

Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

CVSep 30, 2025

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Zhen Yang, Zi-Yi Dou, Di Feng et al.

Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

LGMar 29, 2025

Buyer-Initiated Auction Mechanism for Data Redemption in Machine Unlearning

Bin Han, Di Feng, Jie Wang et al.

The rapid growth of artificial intelligence (AI) has raised privacy concerns over user data, leading to regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). With the essential toolbox provided by machine unlearning, AI service providers are now able to remove user data from their trained models as well as the training datasets, so as to comply with such regulations. However, extensive data redemption can be costly and degrade model accuracy. To balance the cost of unlearning and the privacy protection, we propose a buyer-initiated auction mechanism for data redemption, enabling the service provider to purchase data from willing users with appropriate compensation. This approach does not require the server to have any a priori knowledge about the users' privacy preference, and provides an efficient solution for maximizing the social welfare in the investigated problem.

CVDec 18, 2020

Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection

Di Feng, Zining Wang, Yiyang Zhou et al.

The availability of many real-world driving datasets is a key reason behind the recent progress of object detection algorithms in autonomous driving. However, there exist ambiguity or even failures in object labels due to error-prone annotation process or sensor observation noise. Current public object detection datasets only provide deterministic object labels without considering their inherent uncertainty, as does the common training process or evaluation metrics for object detectors. As a result, an in-depth evaluation among different object detection methods remains challenging, and the training process of object detectors is sub-optimal, especially in probabilistic object detection. In this work, we infer the uncertainty in bounding box labels from LiDAR point clouds based on a generative model, and define a new representation of the probabilistic bounding box through a spatial uncertainty distribution. Comprehensive experiments show that the proposed model reflects complex environmental noises in LiDAR perception and the label quality. Furthermore, we propose Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty. We conduct an in-depth comparison among several LiDAR-based object detectors using the JIoU metric. Finally, we incorporate the proposed label uncertainty in a loss function to train a probabilistic object detector and to improve its detection accuracy. We verify our proposed methods on two public datasets (KITTI, Waymo), as well as on simulation data. Code is released at https://bit.ly/2W534yo.

CVAug 10, 2020

Labels Are Not Perfect: Improving Probabilistic Object Detection via Label Uncertainty

Di Feng, Lars Rosenbaum, Fabian Timm et al.

Reliable uncertainty estimation is crucial for robust object detection in autonomous driving. However, previous works on probabilistic object detection either learn predictive probability for bounding box regression in an un-supervised manner, or use simple heuristics to do uncertainty regularization. This leads to unstable training or suboptimal detection performance. In this work, we leverage our previously proposed method for estimating uncertainty inherent in ground truth bounding box parameters (which we call label uncertainty) to improve the detection accuracy of a probabilistic LiDAR-based object detector. Experimental results on the KITTI dataset show that our method surpasses both the baseline model and the models based on simple heuristics by up to 3.6% in terms of Average Precision.

CVApr 16, 2020

Where can I drive? A System Approach: Deep Ego Corridor Estimation for Robust Automated Driving

Thomas Michalke, Di Feng, Claudius Gläser et al.

Lane detection is an essential part of the perception sub-architecture of any automated driving (AD) or advanced driver assistance system (ADAS). When focusing on low-cost, large scale products for automated driving, model-driven approaches for the detection of lane markings have proven good performance. More recently, data-driven approaches have been proposed that target the drivable area / freespace mainly in inner-city applications. Focus of these approaches is less on lane-based driving due to the fact that the lane concept does not fully apply in unstructured, residential inner-city environments. So-far the concept of drivable area is seldom used for highway and inter-urban applications due to the specific requirements of these scenarios that require clear lane associations of all traffic participants. We believe that lane-based, mapless driving in inter-urban and highway scenarios is still not fully handled with sufficient robustness and availability. Especially for challenging weather situations such as heavy rain, fog, low-standing sun, darkness or reflections in puddles, the mapless detection of lane markings decreases significantly or completely fails. We see potential in applying specifically designed data-driven freespace approaches in more lane-based driving applications for highways and inter-urban use. Therefore, we propose to classify specifically a drivable corridor of the ego lane on pixel level with a deep learning approach. Our approach is kept computationally efficient with only 0.66 million parameters allowing its application in large scale products. Thus, we were able to easily integrate into an online AD system of a test vehicle. We demonstrate the performance of our approach under challenging conditions qualitatively and quantitatively in comparison to a state-of-the-art model-driven approach.

CVMar 7, 2020

Inferring Spatial Uncertainty in Object Detection

Zining Wang, Di Feng, Yiyang Zhou et al.

The availability of real-world datasets is the prerequisite for developing object detection methods for autonomous driving. While ambiguity exists in object labels due to error-prone annotation process or sensor observation noises, current object detection datasets only provide deterministic annotations without considering their uncertainty. This precludes an in-depth evaluation among different object detection methods, especially for those that explicitly model predictive probability. In this work, we propose a generative model to estimate bounding box label uncertainties from LiDAR point clouds, and define a new representation of the probabilistic bounding box through spatial distribution. Comprehensive experiments show that the proposed model represents uncertainties commonly seen in driving scenarios. Based on the spatial distribution, we further propose an extension of IoU, called the Jaccard IoU (JIoU), as a new evaluation metric that incorporates label uncertainty. Experiments on the KITTI and the Waymo Open Datasets show that JIoU is superior to IoU when evaluating probabilistic object detectors.

ROFeb 1, 2020

Leveraging Uncertainties for Deep Multi-modal Object Detection in Autonomous Driving

Di Feng, Yifan Cao, Lars Rosenbaum et al.

This work presents a probabilistic deep neural network that combines LiDAR point clouds and RGB camera images for robust, accurate 3D object detection. We explicitly model uncertainties in the classification and regression tasks, and leverage uncertainties to train the fusion network via a sampling mechanism. We validate our method on three datasets with challenging real-world driving scenarios. Experimental results show that the predicted uncertainties reflect complex environmental uncertainty like difficulties of a human expert to label objects. The results also show that our method consistently improves the Average Precision by up to 7% compared to the baseline method. When sensors are temporally misaligned, the sampling method improves the Average Precision by up to 20%, showing its high robustness against noisy sensor inputs.

ROSep 26, 2019

Can We Trust You? On Calibration of a Probabilistic Object Detector for Autonomous Driving

Di Feng, Lars Rosenbaum, Claudius Glaeser et al.

Reliable uncertainty estimation is crucial for perception systems in safe autonomous driving. Recently, many methods have been proposed to model uncertainties in deep learning based object detectors. However, the estimated probabilities are often uncalibrated, which may lead to severe problems in safety critical scenarios. In this work, we identify such uncertainty miscalibration problems in a probabilistic LiDAR 3D object detection network, and propose three practical methods to significantly reduce errors in uncertainty calibration. Extensive experiments on several datasets show that our methods produce well-calibrated uncertainties, and generalize well between different datasets.

ROFeb 21, 2019

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

Di Feng, Christian Haase-Schütz, Lars Rosenbaum et al.

Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.

ROJan 29, 2019

Deep Active Learning for Efficient Training of a LiDAR 3D Object Detector

Di Feng, Xiao Wei, Lars Rosenbaum et al.

Training a deep object detector for autonomous driving requires a huge amount of labeled data. While recording data via on-board sensors such as camera or LiDAR is relatively easy, annotating data is very tedious and time-consuming, especially when dealing with 3D LiDAR points or radar data. Active learning has the potential to minimize human annotation efforts while maximizing the object detector's performance. In this work, we propose an active learning method to train a LiDAR 3D object detector with the least amount of labeled training data necessary. The detector leverages 2D region proposals generated from the RGB images to reduce the search space of objects and speed up the learning process. Experiments show that our proposed method works under different uncertainty estimations and query functions, and can save up to 60% of the labeling efforts while reaching the same network performance.

ROSep 14, 2018

Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection

Di Feng, Lars Rosenbaum, Fabian Timm et al.

We present a robust real-time LiDAR 3D object detector that leverages heteroscedastic aleatoric uncertainties to significantly improve its detection performance. A multi-loss function is designed to incorporate uncertainty estimations predicted by auxiliary output layers. Using our proposed method, the network ignores to train from noisy samples, and focuses more on informative ones. We validate our method on the KITTI object detection benchmark. Our method surpasses the baseline method which does not explicitly estimate uncertainties by up to nearly 9% in terms of Average Precision (AP). It also produces state-of-the-art results compared to other methods while running with an inference time of only 72 ms. In addition, we conduct extensive experiments to understand how aleatoric uncertainties behave. Extracting aleatoric uncertainties brings almost no additional computation cost during the deployment, making our method highly desirable for autonomous driving applications.

ROJul 3, 2018

Leveraging Robotic Prior Tactile Exploratory Action Experiences For Learning New Objects's Physical Properties

Di Feng

Reusing the tactile knowledge of some previously-explored objects helps us humans to easily recognize the tactual properties of new objects. In this master thesis, we enable arobotic arm equipped with multi-modal artificial skin, like humans, to actively transfer the prior tactile exploratory action experiences when it learns the detailed physical properties of new objects. These prior tactile experiences are built when the robot applies the pressing, sliding and static contact movements on objects with different action parameters and perceives the tactile feedbacks from multiple sensory modalities. Our method was systematically evaluated by several experiments. Results show that the robot could consistently improve the discrimination accuracy by over 10% when it exploited the prior tactile knowledge compared with using no transfer method, and 25% when it used only one training sample. The results also show that the proposed method was robust against transferring irrelevant prior tactile knowledge.

ROApr 13, 2018

Towards Safe Autonomous Driving: Capture Uncertainty in the Deep Neural Network For Lidar 3D Vehicle Detection

Di Feng, Lars Rosenbaum, Klaus Dietmayer

To assure that an autonomous car is driving safely on public roads, its object detection module should not only work correctly, but show its prediction confidence as well. Previous object detectors driven by deep learning do not explicitly model uncertainties in the neural network. We tackle with this problem by presenting practical methods to capture uncertainties in a 3D vehicle detector for Lidar point clouds. The proposed probabilistic detector represents reliable epistemic uncertainty and aleatoric uncertainty in classification and localization tasks. Experimental results show that the epistemic uncertainty is related to the detection accuracy, whereas the aleatoric uncertainty is influenced by vehicle distance and occlusion. The results also show that we can improve the detection performance by 1%-5% by modeling the aleatoric uncertainty.