DCPFMar 19

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia

arXiv:2603.1869520.9h-index: 3
Predicted impact top 67% in DC · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses performance portability for GPU developers, offering a proof-of-concept that JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality, though it is incremental in improving existing portable frameworks.

The paper tackles the overhead of portable GPU frameworks by introducing KernelForge.jl, a Julia library that implements primitives like scan and mapreduce, achieving performance matching or exceeding vendor-optimized libraries such as CUB and cuBLAS on NVIDIA and AMD GPUs.

Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes