JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

arXiv:2604.058857.31 citationsHas Code
Predicted impact top 89% in DC · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses performance bottlenecks in GPU-based high-performance computing for spatial algorithms, offering a foundational solution that is incremental in optimizing existing tree methods for GPU architectures.

The paper tackled the challenge of efficiently implementing spatial tree traversal algorithms on GPUs, which often suffer from thread divergence and irregular memory access, by proposing a Morton plane-based tree hierarchy designed for GPU architectures. The result was more than an order-of-magnitude performance improvement over competing GPU libraries for large-scale problems, such as k-nearest neighbor search and friends-of-friends clustering, with strong scaling to multi-GPU systems.

Algorithms based on spatial tree traversal are widely regarded as among the most efficient and flexible approaches for many problems in CPU-based high-performance computing (HPC). However, directly transferring these algorithms to GPU architectures often yields substantially smaller performance gains than expected in light of the high computational throughput of modern GPUs. The branching nature of tree algorithms leads to thread divergence and irregular memory access patterns -- both of which may severely limit GPU performance. To address these challenges, we propose a Morton (z-order) 'plane-based tree hierarchy' that is specifically designed for GPU architectures. The resulting flattened data layout enables efficient dual-tree traversal with collaborative execution across thread groups, leading to highly coalesced memory access patterns. Based on this framework we present implementations of two important spatial algorithms -- exact $k$-nearest neighbour search and friends-of-friends (FoF) clustering. For both cases, we observe more than an order-of-magnitude performance improvement over the closest competing GPU libraries for large problem sizes ($N \gtrsim 10^7$), together with strong scaling to distributed multi-GPU systems. We provide an open-source implementation, 'JZ-Tree' (JAX z-order tree), which serves as a foundation for efficient GPU implementations of a broad class of tree-based algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes