CV LG IVAug 10, 2022

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang

Meta AI

arXiv:2208.05163v115.677 citationsh-index: 81

Originality Incremental advance

AI Analysis

This work addresses the hardware acceleration bottleneck for ViTs in computer vision, offering a domain-specific solution for FPGA deployment.

The paper tackles the challenge of accelerating Vision Transformers (ViTs) on FPGAs by proposing an automatic framework with mixed-scheme quantization, achieving up to 1.36% higher Top-1 accuracy compared to algorithmic-only quantization and a 5.6x frame rate improvement (56.8 FPS vs. 10.0 FPS) with minimal accuracy drop on ImageNet.

Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for DeiT-base.

View on arXiv PDF

Similar