LGCVDCMLOct 7, 2019

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

arXiv:1910.02653v3240 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the memory wall problem for deep learning practitioners, offering a practical tool to scale training with incremental improvements over prior checkpointing methods.

The paper tackles the memory bottleneck in DNN training by formalizing tensor rematerialization as an optimization problem, introducing Checkmate to compute optimal or near-optimal schedules that reduce training costs and enable up to 5.1x larger input sizes.

We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes. Checkmate is an open-source project, available at https://github.com/parasj/checkmate.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes