RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
This work addresses the problem of efficient and high-quality recommendation inference for large-scale systems, representing a domain-specific advancement with incremental hardware co-design.
The paper tackles the challenge of optimizing deep learning recommendation systems for high quality and performance under strict latency and load constraints by introducing RecPipe and its accelerator RPAccel, which achieve 3x latency and 6x throughput improvements at iso-quality compared to prior art.
Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.