Auto-Vectorizing TensorFlow Graphs: Jacobians, Auto-Batching And Beyond
This work addresses performance bottlenecks in machine learning frameworks like TensorFlow for developers and researchers, though it appears incremental as it builds on existing high-level dataflow IR.
The authors tackled the problem of inefficient loop-based operations in TensorFlow by proposing a static loop vectorization optimization, achieving huge speedups compared to existing methods like loop-based implementations and DyNet's run-time batching.
We propose a static loop vectorization optimization on top of high level dataflow IR used by frameworks like TensorFlow. A new statically vectorized parallel-for abstraction is provided on top of TensorFlow, and used for applications ranging from auto-batching and per-example gradients, to jacobian computation, optimized map functions and input pipeline optimization. We report huge speedups compared to both loop based implementations, as well as run-time batching adopted by the DyNet framework.