DC AIJul 1, 2025

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs

Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank Würthwein

arXiv:2507.00418v3h-index: 45Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses energy efficiency and resource allocation for LLM serving in HPC clusters, offering insights for energy-constrained deployments, but it is incremental as it compares existing hardware without introducing new methods.

This study benchmarks the Qualcomm Cloud AI 100 Ultra accelerator for LLM inference against NVIDIA A100 GPUs, finding that it achieves competitive energy efficiency with up to 35x lower power consumption for smaller models and enables more granular hardware allocation, such as running 70B models on 1 card versus 8 GPUs.

This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt), performance, and hardware scalability against NVIDIA A100 GPUs (in 4x and 8x configurations) within the National Research Platform (NRP) ecosystem. A total of 12 open-source LLMs, ranging from 124 million to 70 billion parameters, are served using the vLLM framework. Our analysis reveals that QAic achieves competitive energy efficiency with advantages on specific models while enabling more granular hardware allocation: some 70B models operate on as few as 1 QAic card versus 8 A100 GPUs required, with 20x lower power consumption (148W vs 2,983W). For smaller models, single QAic devices achieve up to 35x lower power consumption compared to our 4-GPU A100 configuration (36W vs 1,246W). The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for energy-constrained and resource-efficient HPC deployments within the National Research Platform (NRP).

View on arXiv PDF

Similar