LG DC IR PFNov 9, 2022

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, Christos Kozyrakis

Stanford

arXiv:2211.05239v411.113 citationsh-index: 77

Originality Incremental advance

AI Analysis

This addresses inefficiencies in recommendation systems for large-scale industry applications, but it is incremental as it optimizes existing infrastructure rather than introducing a new paradigm.

The paper tackles the problem of feature duplication in industry-scale deep learning recommendation model training, which causes storage, preprocessing, and training overheads, and shows that RecD improves training throughput by up to 2.48x, preprocessing by 1.79x, and storage efficiency by 3.71x.

We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.

View on arXiv PDF

Similar