ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation
This work addresses the challenge of near-field perception for self-driving vehicles and autonomous mobile robotics, representing an incremental advancement by replacing CNN backbones with vision transformers.
The paper tackles the problem of generating detailed Bird's Eye View (BEV) maps for autonomous vehicles and robotics by proposing ViT-BEVSeg, a hierarchical transformer network, and demonstrates a considerable improvement in performance on the nuScenes dataset compared to state-of-the-art approaches.
Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches.