PF LGAug 29, 2019

TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir

arXiv:1908.11338v11.25 citations

Originality Incremental advance

AI Analysis

This work addresses performance inefficiencies in machine-learning compilers for developers and researchers, but it is incremental as it builds on existing Tapir and XLA technologies.

The authors tackled the problem of compilers in machine-learning frameworks lacking deep understanding of parallelism, which causes missed optimizations and performance loss, by introducing TapirXLA, a replacement for TensorFlow's XLA compiler that embeds fork-join parallelism, resulting in speedups of 30% to 100% on neural-network benchmarks across CPU architectures.

This work introduces TapirXLA, a replacement for TensorFlow's XLA compiler that embeds recursive fork-join parallelism into XLA's low-level representation of code. Machine-learning applications rely on efficient parallel processing to achieve performance, and they employ a variety of technologies to improve performance, including compiler technology. But compilers in machine-learning frameworks lack a deep understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir, a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir's representation of fork-join parallelism. TapirXLA also exposes to the compiler implementations of linear-algebra library routines whose parallel operations are encoded using Tapir's representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean multiplicative factor of 30% to 100%, across four CPU architectures.

View on arXiv PDF

Similar