CVJul 18, 2023

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

arXiv:2307.09120v13 citationsh-index: 50
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of real-time performance for autonomous driving tasks by making Vision Transformers more efficient, though it is incremental as it builds on an existing architecture.

The paper tackles the problem of deploying Vision Transformers on resource-limited hardware for autonomous driving by redesigning PLG-ViT to be more compact, reducing its size by a factor of 5 with a moderate performance drop and achieving 79.5% top-1 accuracy on ImageNet-1K with 5 million parameters.

While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes