ARAIMay 13, 2024

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

arXiv:2405.07518v240 citationsh-index: 69Micro
Originality Incremental advance
AI Analysis

This addresses the memory wall problem for AI deployment, offering a scalable solution for enterprise inference and training, though it is incremental as it builds on existing CoE and dataflow concepts.

The paper tackles the high cost and memory challenges of deploying large language models by proposing Samba-CoE, a Composition of Experts system with 150 experts and a trillion parameters, deployed on the SambaNova SN40L accelerator, achieving speedups of up to 13x and reducing machine footprint by up to 19x compared to baselines.

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2$\times$ to 13$\times$ on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19$\times$, speeds up model switching time by 15$\times$ to 31$\times$, and achieves an overall speedup of 3.7$\times$ over a DGX H100 and 6.6$\times$ over a DGX A100.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes