Disaggregating Embedding Recommendation Systems with FlexEMR
This work addresses inefficiencies in serving large-scale recommendation systems for industries relying on such models, though it appears incremental as it builds on existing disaggregation concepts.
The paper tackles the challenge of high memory requirements in embedding-based recommendation models by proposing FlexEMR, a system that disaggregates embedding operations from neural network inference to improve resource utilization and reduce costs, with initial results from an early prototype.
Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.