DCMay 12

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms

Aditya Bhosale, Anant Jain, Shourya Goel, Ritvik Rao, Peddoju Sateesh Kumar, Laxmikant Kale

arXiv:2605.1273431.5

Predicted impact top 51% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For developers of parallel applications using overdecomposition (e.g., adaptive mesh refinement, tree codes), this work provides a portable and efficient solution to a known performance bottleneck on GPGPU clusters.

The paper addresses the challenge of supporting overdecomposition on distributed memory GPGPU platforms, demonstrating that it can be efficiently and portably implemented across different GPU vendors and interconnects, thereby enabling productive use of this parallel programming paradigm.

Overdecomposition has emerged as a powerful and sometimes essential technique in parallel programming. Many application domains or frameworks, including those based on adaptive mesh refinements, or tree codes use it. Charm++ is a parallel programming system which has demonstrated the utility of overdecomposition for many applications and in multiple contexts. However, the emergence of GPGPUs as a dominant compute component has created some real and perceived challenges for this paradigm, especially regarding the higher overhead brought about by overpartitioning -- having multiple objects assigned to the same GPGPU device. We address this issue as well as the issue of portability by developing techniques and software that demonstrate that overdecomposition can be efficiently and productively supported on combinations of GPU vendor types, and interconnection networks.

View on arXiv PDF

Similar