PF AI AR LGSep 13, 2024

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

arXiv:2409.08595v11.2h-index: 4

Originality Incremental advance

AI Analysis

This work addresses the need for efficient performance modeling for hardware accelerators in edge computing, offering an incremental improvement over existing methods.

The paper tackles the challenge of estimating latency for deep neural networks on edge accelerators by presenting an automated approach that generates fast performance models, achieving a speedup of evaluating only 154 loop kernel iterations to estimate performance for 4.19 billion instructions and outperforming regression and analytical models in accuracy.

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

View on arXiv PDF

Similar