LGAIDCMLFeb 1, 2019

TF-Replicator: Distributed Machine Learning for Researchers

arXiv:1902.00465v129 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the challenge for researchers in needing distributed systems expertise to scale machine learning models, though it is incremental as it builds on existing TensorFlow abstractions.

The authors tackled the problem of simplifying distributed machine learning for researchers by introducing TF-Replicator, a framework that allows effortless deployment of models across various cluster architectures, demonstrating strong scalability with benchmarks on ResNet-50, SN-GAN, and D4PG models.

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes