DCAILGPLJan 27

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

arXiv:2601.19092v12 citationsh-index: 33
Originality Highly original
AI Analysis

This addresses the problem of efficient hardware-aware compilation for machine learning practitioners working with heterogeneous accelerators, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of scaling deep learning workloads across complex hardware configurations by introducing Axe Layout, a unified abstraction that maps logical tensor coordinates to physical space via named axes, enabling consistent expression of collective primitives from device meshes to threads. Experiments show this approach achieves performance close to hand-tuned kernels on various GPU devices, multi-device environments, and accelerator backends.

Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes