LG DCDec 14, 2021

HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui

arXiv:2112.07221v116.870 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of slow training for large embedding models in distributed systems, offering a significant performance improvement for machine learning practitioners.

The paper tackles the scalability issue in distributed training of huge embedding models by proposing HET, a cache-enabled framework that reduces communication bottlenecks, achieving up to 88% embedding communication reductions and 20.68x speedup over state-of-the-art baselines.

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.

View on arXiv PDF Code

Similar