DCLGMLMar 12, 2020

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

arXiv:2003.05622v1160 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of scaling deep learning for online advertising industries with massive sparse features, offering a more efficient and cost-effective solution compared to traditional distributed systems.

The paper tackles the problem of training massive-scale deep learning models for online advertising systems, which have terabyte-scale parameters that exceed GPU and CPU memory capacities, by introducing a distributed GPU hierarchical parameter server that utilizes GPU HBM, CPU memory, and SSD as a 3-layer storage hierarchy, achieving over 2X faster training on a 4-node system compared to a 150-node MPI cluster and a 4-9 times better price-performance ratio.

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes