91.9PFMay 6
KEET: Explaining Performance of GPU Kernels Using LLM AgentsJoshua H. Davis, Klaudiusz Rydzy, Srinivasan Ramesh et al.
Performance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool's graphical interface to identify and understand kernel performance bottlenecks. Large Language Models (LLMs) have shown promise in understanding complex data and generating natural language explanations. In this paper, we propose the Kernel Execution Explanation Toolkit (KEET), an LLM-based agentic framework for interpreting Nsight Compute profiles to generate useful and data-grounded natural language explanations of performance issues in GPU kernels, and suggestions for optimizations. We evaluate \toolname using several CUDA kernels of varying complexity on NVIDIA H100 GPUs. We find that the generated explanations, when provided as context, improve the quality of LLM code optimization and multiple-choice question answering in downstream tasks. We further demonstrate that the tool can be used to interpret performance data from large sets of profiles to improve the quality of optimization suggestions.
8.2CVMar 15
A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT ScansAadit Nilay, Bhavesh Thapar, Anant Agrawal et al.
The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.