ARAICLDCApr 18, 2025

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

arXiv:2504.16112v14 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses scalability and cost issues in large-scale LLM inference for AI practitioners, though it is an incremental hardware optimization.

The paper tackles inefficiencies in GPU-based LLM inference caused by memory-bound attention layers by proposing a High-bandwidth Processing Unit (HPU) as a co-processor to offload these operations, resulting in up to 4.1x performance gains and 4.6x energy efficiency improvements over GPU-only systems.

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes