Sunita Chandrasekaran

AI
h-index45
5papers
59citations
Novelty32%
AI Score42

5 Papers

AIOct 8, 2023Code
LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation

Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran

Large language models (LLMs) are a new and powerful tool for a wide span of applications involving natural language and demonstrate impressive code generation abilities. The goal of this work is to automatically generate tests and use these tests to validate and verify compiler implementations of a directive-based parallel programming paradigm, OpenACC. To do so, in this paper, we explore the capabilities of state-of-the-art LLMs, including open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama, Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and GPT-4-Turbo. We further fine-tuned the open-source LLMs and GPT-3.5-Turbo using our own testsuite dataset along with using the OpenACC specification. We also explored these LLMs using various prompt engineering techniques that include code template, template with retrieval-augmented generation (RAG), one-shot example, one-shot with RAG, expressive prompt with code template and RAG. This paper highlights our findings from over 5000 tests generated via all the above mentioned methods. Our contributions include: (a) exploring the capabilities of the latest and relevant LLMs for code generation, (b) investigating fine-tuning and prompt methods, and (c) analyzing the outcome of LLMs generated tests including manually analysis of representative set of tests. We found the LLM Deepseek-Coder-33b-Instruct produced the most passing tests followed by GPT-4-Turbo.

37.3DCMay 5Code
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Aaron Jarmusch, Sunita Chandrasekaran

Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the model captures Tensor Memory (TMEM), asynchronous bulk copy (TMA), and 5th-generation tensor cores; for CDNA3, the model captures Infinity Cache hierarchy, VGPR constraints, and occupancy. Validation yields 1.31% MAE on B200 (21 kernels) and 0.09% on MI300A (27 kernels), while naive roofline baselines exceed 95% error on the same kernels. We further validate the models using Rodinia~3.1 and SPEChpc 2021 Tiny.The models are updated with HBM bandwidth, capacity, and cache parameters and applied to H200 (Hopper) and MI250X (CDNA2), indicating no major restructuring of the models are needed. All models and benchmarks will be released as open-source upon acceptance.

7.1CLMay 15
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

Daria Blinova, Gayathri Emuru, Rakesh Emuru et al.

This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.

COMP-PHJan 6, 2025
The Artificial Scientist -- in-transit Machine Learning of Plasma Simulations

Jeffrey Kelling, Vicente Bolea, Michael Bussmann et al.

Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.

PLOct 16, 2021
Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading

Jeffrey Kelling, Sergei Bastrakov, Alexander Debus et al.

HPC systems employ a growing variety of compute accelerators with different architectures and from different vendors. Large scientific applications are required to run efficiently across these systems but need to retain a single code-base in order to not stifle development. Directive-based offloading programming models set out to provide the required portability, but, to existing codes, they themselves represent yet another API to port to. Here, we present our approach of porting the GPU-accelerated particle-in-cell code PIConGPU to OpenACC and OpenMP target by adding two new backends to its existing C++-template metaprogramming-based offloading abstraction layer alpaka and avoiding other modifications to the application code. We introduce our approach in the face of conflicts between requirements and available features in the standards as well as practical hurdles posed by immature compiler support.