PLApr 15, 2024Code
Towards a high-performance AI compiler with upstream MLIRRenato Golin, Lorenzo Chelini, Adam Siemieniuk et al.
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
LGNov 14, 2025
Library Liberation: Competitive Performance Matmul Through Compiler-composed NanokernelsArun Thangamani, Md Asghar Ahmad Shahid, Adam Siemieniuk et al.
The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.
96.0DCApr 16
Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPUMarcin Spoczynski, Daniel Fleischer, Moshe Berchansky et al.
Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds -- to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle of trial-and-error profiling against hardware constraints that vary across devices, yet the underlying optimization patterns remain largely consistent. We present Xe-Forge, a multi-stage LLM-powered pipeline that automates this process for Intel GPU. Given a functionally correct Triton kernel, the system applies up to nine optimization stages -- from algorithmic restructuring and operator fusion through block pointer modernization, GPU-specific tuning, and open-ended discovery -- each driven by a Chain-of-Verification-and-Refinement (CoVeR) agent that generates candidates, validates them on real hardware, and iterates on failures. A curated knowledge base encodes Intel GPU constraints (power-of-two warp counts, GRF modes, SLM sizing) that are absent from LLM training data, keeping the model within architecturally valid bounds. We evaluate Xe-Forge on 97 Level-2 KernelBench kernels and Flash Attention on the Intel Arc Pro B70, achieving a 1.17x geometric mean speedup over PyTorch eager with 67% of kernels improving, nine kernels exceeding 5x (up to 82x), and 2--13.3x speedups on Flash Attention across all tested configurations without regression -- demonstrating that structured domain knowledge with hardware-in-the-loop verification can systematically eliminate the repetitive porting effort that currently gates algorithm deployment on new accelerators.