ProactivePIM: Accelerating Weight-Sharing Embedding Layer with PIM for Scalable Recommendation System
This work addresses inference scalability for personalized recommendation systems, which is an incremental improvement over existing PIM methods.
The paper tackles the challenge of accelerating weight-sharing embedding layers in recommendation systems by proposing ProactivePIM, a processing-in-memory system that integrates a cache with prefetching and subtable mapping to eliminate communication overhead, achieving a 4.8x speedup over prior works.
The model size growth of personalized recommendation systems poses new challenges for inference. Weight-sharing algorithms have been proposed for size reduction, but they increase memory access. Recent advancements in processing-in-memory (PIM) enhanced the model throughput by exploiting memory parallelism, but such algorithms introduce massive CPU-PIM communication into prior PIM systems. We propose ProactivePIM, a PIM system for weight-sharing recommendation system acceleration. ProactivePIM integrates a cache within the PIM with a prefetching scheme to leverage a unique locality of the algorithm and eliminate communication overhead through a subtable mapping strategy. ProactivePIM achieves a 4.8x speedup compared to prior works.