The Architectural Implications of Facebook's DNN-based Personalized Recommendation
This work addresses the problem of optimizing at-scale recommendation systems for data centers, though it is incremental by providing foundational analysis and benchmarks.
The paper tackles the lack of systems research for deep neural network-based personalized recommendation by presenting real-world production-scale DNNs and performance metrics, revealing that inference latency varies by 60% across server generations and batching improves throughput.
The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.