LGDCIRJan 9, 2024

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

arXiv:2401.04338v114 citationsh-index: 11CIKM
Originality Incremental advance
AI Analysis

This work addresses a critical efficiency problem for large-scale recommender systems in industry, enabling faster model updates and better performance in applications like advertising, though it is incremental as it optimizes an existing paradigm rather than introducing a new one.

The paper tackles the inefficiency of distributed training for meta learning-based deep learning recommendation models in GPU clusters by introducing G-Meta, a high-performance framework that achieves notable training speed without statistical performance loss, as demonstrated by deployment in Alipay's systems with a 6.48% improvement in Conversion Rate and 1.06% increase in Cost Per Mille.

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes