CVAIROAug 10, 2020

Driving among Flatmobiles: Bird-Eye-View occupancy grids from a monocular camera for holistic trajectory planning

arXiv:2008.04047v133 citations
AI Analysis

This work addresses the need for more interpretable and accurate autonomous driving systems, but it is incremental as it builds on existing intermediate representation approaches.

The paper tackles the problem of interpretability and accuracy in camera-based end-to-end driving networks by introducing a novel monocular camera-only holistic trajectory planning network with a Bird-Eye-View intermediate representation using binary Occupancy Grid Maps. The result is a method that predicts OGMs in BEV from camera images by first predicting semantic masks in camera view and warping them using homography, respecting the flat world hypothesis.

Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered building blocks but their black-box nature makes them difficult to delve in case of failure. Recent works have shown the importance of using an explicit intermediate representation that has the benefits of increasing both the interpretability and the accuracy of networks' decisions. Nonetheless, these camera-based networks reason in camera view where scale is not homogeneous and hence not directly suitable for motion forecasting. In this paper, we introduce a novel monocular camera-only holistic end-to-end trajectory planning network with a Bird-Eye-View (BEV) intermediate representation that comes in the form of binary Occupancy Grid Maps (OGMs). To ease the prediction of OGMs in BEV from camera images, we introduce a novel scheme where the OGMs are first predicted as semantic masks in camera view and then warped in BEV using the homography between the two planes. The key element allowing this transformation to be applied to 3D objects such as vehicles, consists in predicting solely their footprint in camera-view, hence respecting the flat world hypothesis implied by the homography.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes