LGMar 19

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

arXiv:2603.1933842.0h-index: 6

Predicted impact top 58% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses hardware efficiency for on-device AI, offering incremental improvements in activation functions for Transformers.

The paper tackled the problem of inefficient non-linear activation functions in on-device Transformer inference and training by proposing DAPA, a distribution-aware piecewise activation function, which achieved a 16× speedup in GELU computation and 16× reduction in DSP utilization while maintaining performance.

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.

View on arXiv PDF

Similar