Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
This work addresses the computational bottleneck of attention operations in AI models for hardware developers, though it is incremental as it builds on the existing FlashAttention algorithm.
This paper tackles the acceleration of attention kernels in machine learning models by vectorizing the FlashAttention algorithm for RISC-V vector processors, using a low-cost exponential approximation to reduce computational complexity without custom instructions. Experimental results show significant performance gains in processing attention layers.
Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.