LG CV DC MLOct 7, 2019

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph E. Gonzalez

arXiv:1910.02653v322.0241 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the memory wall problem for deep learning practitioners, offering a practical tool to scale training with incremental improvements over prior checkpointing methods.

The paper tackles the memory bottleneck in DNN training by formalizing tensor rematerialization as an optimization problem, introducing Checkmate to compute optimal or near-optimal schedules that reduce training costs and enable up to 5.1x larger input sizes.

We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes. Checkmate is an open-source project, available at https://github.com/parasj/checkmate.

View on arXiv PDF Code

Similar