Aaron Jarmusch

2papers

2 Papers

AIOct 8, 2023Code
LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation

Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran

Large language models (LLMs) are a new and powerful tool for a wide span of applications involving natural language and demonstrate impressive code generation abilities. The goal of this work is to automatically generate tests and use these tests to validate and verify compiler implementations of a directive-based parallel programming paradigm, OpenACC. To do so, in this paper, we explore the capabilities of state-of-the-art LLMs, including open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama, Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and GPT-4-Turbo. We further fine-tuned the open-source LLMs and GPT-3.5-Turbo using our own testsuite dataset along with using the OpenACC specification. We also explored these LLMs using various prompt engineering techniques that include code template, template with retrieval-augmented generation (RAG), one-shot example, one-shot with RAG, expressive prompt with code template and RAG. This paper highlights our findings from over 5000 tests generated via all the above mentioned methods. Our contributions include: (a) exploring the capabilities of the latest and relevant LLMs for code generation, (b) investigating fine-tuning and prompt methods, and (c) analyzing the outcome of LLMs generated tests including manually analysis of representative set of tests. We found the LLM Deepseek-Coder-33b-Instruct produced the most passing tests followed by GPT-4-Turbo.

25.6DCMay 5Code
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Aaron Jarmusch, Sunita Chandrasekaran

Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the model captures Tensor Memory (TMEM), asynchronous bulk copy (TMA), and 5th-generation tensor cores; for CDNA3, the model captures Infinity Cache hierarchy, VGPR constraints, and occupancy. Validation yields 1.31% MAE on B200 (21 kernels) and 0.09% on MI300A (27 kernels), while naive roofline baselines exceed 95% error on the same kernels. We further validate the models using Rodinia~3.1 and SPEChpc 2021 Tiny.The models are updated with HBM bandwidth, capacity, and cache parameters and applied to H200 (Hopper) and MI250X (CDNA2), indicating no major restructuring of the models are needed. All models and benchmarks will be released as open-source upon acceptance.