SE AI DC PLSep 12, 2023

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash, Jeffrey S. Vetter

arXiv:2309.07103v119.341 citationsh-index: 34Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating generative AI models for code generation in HPC, but it is incremental as it builds upon previous research with GPT-3.

The study compared the open-source Llama-2 model with GPT-3 for generating high-performance computing kernels across various programming models and languages, finding that Llama-2 showed competitive or superior accuracy, with Copilot producing more reliable but less optimized code and Llama-2 generating less reliable but more optimized code when correct.

We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work that is based on the OpenAI Codex, which is a descendant of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot. Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline by using a similar metric. Llama-2 has a simplified model that shows competitive or even superior accuracy. We also report on the differences between these foundational large language models as generative AI continues to redefine human-computer interactions. Overall, Copilot generates codes that are more reliable but less optimized, whereas codes generated by Llama-2 are less reliable but more optimized when correct.

View on arXiv PDF

Similar