Douglas Thain

SE
3papers
20citations
Novelty55%
AI Score39

3 Papers

17.3SEMar 27
Efficiently Reproducing Distributed Workflows in Notebook-based Systems

Talha Azaz, Raza Ahmad, Md Saiful Islam et al.

Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.

LGJul 5, 2017
SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization

Jeff Kinnison, Nathaniel Kremer-Herman, Douglas Thain et al.

Computer vision is experiencing an AI renaissance, in which machine learning models are expediting important breakthroughs in academic research and commercial applications. Effectively training these models, however, is not trivial due in part to hyperparameters: user-configured values that control a model's ability to learn from data. Existing hyperparameter optimization methods are highly parallel but make no effort to balance the search across heterogeneous hardware or to prioritize searching high-impact spaces. In this paper, we introduce a framework for massively Scalable Hardware-Aware Distributed Hyperparameter Optimization (SHADHO). Our framework calculates the relative complexity of each search space and monitors performance on the learning task over all trials. These metrics are then used as heuristics to assign hyperparameters to distributed workers based on their hardware. We first demonstrate that our framework achieves double the throughput of a standard distributed hyperparameter optimization framework by optimizing SVM for MNIST using 150 distributed workers. We then conduct model search with SHADHO over the course of one week using 74 GPUs across two compute clusters to optimize U-Net for a cell segmentation task, discovering 515 models that achieve a lower validation loss than standard U-Net.

SEApr 15, 2016
DISTEA: Efficient Dynamic Impact Analysis for Distributed Systems

Haipeng Cai, Douglas Thain

Dynamic impact analysis is a fundamental technique for understanding the impact of specific program entities, or changes to them, on the rest of the program for concrete executions. However, existing techniques are either inapplicable or of very limited utility for distributed programs running in multiple concurrent processes. This paper presents DISTEA, a technique and tool for dynamic impact analysis of distributed systems. By partially ordering distributed method-execution events and inferring causality from the ordered events, DISTEA can predict impacts propagated both within and across process boundaries. We implemented DISTEA for Java and applied it to four distributed programs of various types and sizes, including two enterprise systems. We also evaluated the precision and practical usefulness of DISTEA, and demonstrated its application in program comprehension, through two case studies. The results show that DISTEA is highly scalable, more effective than existing alternatives, and instrumental to understanding distributed systems and their executions.