AR AI LGApr 11, 2022

Heterogeneous Acceleration Pipeline for Recommendation System Training

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

arXiv:2204.05436v210.824 citationsh-index: 25

Originality Incremental advance

AI Analysis

This addresses scaling and cost issues in recommendation system training for large-scale applications, though it is an incremental improvement over existing hybrid methods.

The paper tackles the computational and memory inefficiencies in training recommendation models by introducing Hotline, a heterogeneous acceleration pipeline that uses CPU memory for non-popular embeddings and GPU HBM for popular ones, reducing average end-to-end training time by 2.2x compared to a baseline.

Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.

View on arXiv PDF

Similar