Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended
This work provides a solution for ML practitioners facing GPU memory limitations, enabling faster training and inference without accuracy loss, which is a strong specific gain for those using GNNs, DLRM, and LLMs.
This paper addresses the GPU memory bottleneck in machine learning by introducing Invariant Bit Packing (IBP), a novel lossless compression algorithm. IBP significantly reduces data transfer times, leading to 74% faster GNN training, 180% faster DLRM embedding lookup, and 24% faster LLM inference.
Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been proposed to relieve bottlenecks but introduces workload-dependent accuracy loss, making it complex or even prohibitive to use in existing ML deployments. We explore lossless compression as an alternative that avoids this deployment complexity. We identify where lossless compression can be integrated into ML pipelines while minimizing interference with GPU execution. Based on our findings, we introduce Invariant Bit Packing (IBP), a novel lossless compression algorithm designed to minimize data transfer time for ML. IBP identifies and eliminates invariant bits across groups of tensors, improving throughput through GPU-optimized decompression that leverages warp parallelism, low-overhead bit operations, and asynchronous PCIe transfers. We provide easy-to-use APIs, showcasing them by adding IBP support to GNN training, as well as DLRM and LLM inference frameworks. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 24% faster LLM inference.