PFLGAug 3, 2020

A Learned Performance Model for Tensor Processing Units

arXiv:2008.01040v212 citations
AI Analysis

This addresses the problem of efficient code generation for deep learning accelerators, particularly TPUs, by providing a more accurate performance model, though it is incremental as it builds on existing methods for performance modeling.

The paper tackles the challenge of developing accurate hardware performance models for complex processors like Tensor Processing Units (TPUs) by learning models from tensor computation graph programs, resulting in a learned model that outperforms an analytical model in tile-size selection and operator fusion tasks and aids autotuners in finding faster programs.

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes