DC AI LG PLJan 27

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

Bohan Hou, Hongyi Jin, Guanjie Wang, Jinqi Chen, Yaxing Cai, Lijie Yang, Zihao Ye, Yaoyao Ding, Ruihang Lai, Tianqi Chen

arXiv:2601.19092v13.32 citationsh-index: 33

Originality Highly original

AI Analysis

This addresses the problem of efficient hardware-aware compilation for machine learning practitioners working with heterogeneous accelerators, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of scaling deep learning workloads across complex hardware configurations by introducing Axe Layout, a unified abstraction that maps logical tensor coordinates to physical space via named axes, enabling consistent expression of collective primitives from device meshes to threads. Experiments show this approach achieves performance close to hand-tuned kernels on various GPU devices, multi-device environments, and accelerator backends.

Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.

View on arXiv PDF

Similar