LG AR DCMar 21, 2024

AI and Memory Wall

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

arXiv:2403.14123v138.3374 citationsh-index: 42IEEE Micro

Originality Synthesis-oriented

AI Analysis

This addresses a critical hardware-software co-design problem for AI practitioners and researchers, highlighting an incremental but urgent shift in bottleneck focus.

The paper tackles the problem of memory bandwidth becoming the primary performance bottleneck in AI, especially for serving large language models, due to slower scaling compared to compute, and argues for redesigning model architectures and strategies to address this limitation.

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.

View on arXiv PDF

Similar