ARAILGSYSep 21, 2024

ProTEA: Programmable Transformer Encoder Acceleration on FPGA

arXiv:2409.13975v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the need for efficient hardware acceleration for transformer encoders, offering a programmable solution that improves performance for AI applications, though it is incremental in optimizing existing methods.

The paper tackles the lack of flexible hardware accelerators for transformer neural networks by introducing ProTEA, a runtime programmable FPGA accelerator for dense computations, which achieves up to 2.8x speedup over existing FPGA accelerators and is 2.5x faster than a GPU.

Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV). Their widespread adoption has been primarily driven by the exceptional performance of their multi-head self-attention block used to extract key features from sequential data. The multi-head self-attention block is followed by feedforward neural networks, which play a crucial role in introducing non-linearity to assist the model in learning complex patterns. Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. Most prior works have concentrated on sparse architectures that are not flexible for popular TNN variants. This paper introduces \textit{ProTEA}, a runtime programmable accelerator tailored for the dense computations of most of state-of-the-art transformer encoders. \textit{ProTEA} is designed to reduce latency by maximizing parallelism. We introduce an efficient tiling of large matrices that can distribute memory and computing resources across different hardware components within the FPGA. We provide run time evaluations of \textit{ProTEA} on a Xilinx Alveo U55C high-performance data center accelerator card. Experimental results demonstrate that \textit{ProTEA} can host a wide range of popular transformer networks and achieve near optimal performance with a tile size of 64 in the multi-head self-attention block and 6 in the feedforward networks block when configured with 8 parallel attention heads, 12 layers, and an embedding dimension of 768 on the U55C. Comparative results are provided showing \textit{ProTEA} is 2.5$\times$ faster than an NVIDIA Titan XP GPU. Results also show that it achieves 1.3 -- 2.8$\times$ speed up compared with current state-of-the-art custom designed FPGA accelerators.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes