ARApr 13

Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

arXiv:2604.111286.4h-index: 6
Predicted impact top 69% in AR · last 90 daysOriginality Synthesis-oriented
AI Analysis

For engineers designing mobile or edge devices, this work provides performance insights into memory technologies for gen-AI inference, though it is an incremental analysis without novel methods or empirical results.

This paper evaluates two technology solutions—High Bandwidth Storage (HBS) for large models and bonded global buffer memory chiplets for small models—to alleviate memory pressure in generative AI inference on resource-constrained platforms, using a hierarchical roofline model to outline bandwidth/latency requirements for acceptable throughput.

The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High Bandwidth Storage (HBS) and outline bandwidth/latency requirements to achieve an acceptable throughput for interactivity. For small models (e.g., 1B parameters), we evaluate the merit of a bonded global buffer memory chiplet and propose how to best utilize it.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes