ARLGMay 13, 2025

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

arXiv:2506.00008v13 citationsh-index: 1
Originality Highly original
AI Analysis

It provides quantitative guidance for matching workloads to accelerators and reveals architectural gaps for next-generation designs, addressing a critical problem for hardware developers and AI practitioners deploying large models.

This paper conducted the first cross-architectural performance study of commercial AI accelerators for large language model inference, finding up to 3.7x performance variation across architectures and showing that expert parallelism offers an 8.4x parameter-to-compute advantage but with 2.1x higher latency variance than tensor parallelism.

The rapid growth of large-language models (LLMs) is driving a new wave of specialized hardware for inference. This paper presents the first workload-centric, cross-architectural performance study of commercial AI accelerators, spanning GPU-based chips, hybrid packages, and wafer-scale engines. We compare memory hierarchies, compute fabrics, and on-chip interconnects, and observe up to 3.7x performance variation across architectures as batch size and sequence length change. Four scaling techniques for trillion-parameter models are examined; expert parallelism offers an 8.4x parameter-to-compute advantage but incurs 2.1x higher latency variance than tensor parallelism. These findings provide quantitative guidance for matching workloads to accelerators and reveal architectural gaps that next-generation designs must address.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes