Towards a high-performance AI compiler with upstream MLIR
This work addresses performance bottlenecks in AI compilation for developers using frameworks like TensorFlow and PyTorch, though it appears incremental as it builds on existing MLIR infrastructure.
The authors tackled the problem of achieving high-performance AI compilation from generic linear algebra abstractions by developing an MLIR-based compilation flow with cache-level optimizations and micro-kernel lowering, achieving over 90% of the performance of hand-optimized ninja-written programs.
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.