Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml
This enables low-latency, real-time processing for physics applications like high energy physics and LIGO, though it is incremental as it adapts existing methods to new hardware.
The study tackled efficient transformer inference on FPGAs using hls4ml, achieving latencies under 2 microseconds for real-time physics applications.
This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO