Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation
This work addresses the problem of evaluating generative AI models for code generation in HPC, but it is incremental as it builds upon previous research with GPT-3.
The study compared the open-source Llama-2 model with GPT-3 for generating high-performance computing kernels across various programming models and languages, finding that Llama-2 showed competitive or superior accuracy, with Copilot producing more reliable but less optimized code and Llama-2 generating less reliable but more optimized code when correct.
We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work that is based on the OpenAI Codex, which is a descendant of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot. Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline by using a similar metric. Llama-2 has a simplified model that shows competitive or even superior accuracy. We also report on the differences between these foundational large language models as generative AI continues to redefine human-computer interactions. Overall, Copilot generates codes that are more reliable but less optimized, whereas codes generated by Llama-2 are less reliable but more optimized when correct.