ARAILGFeb 14, 2025

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs

arXiv:2503.11663v16 citationsh-index: 12MLSys
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient LLM deployment on resource-constrained edge devices, offering a domain-specific optimization that is incremental over existing quantization and sparse acceleration techniques.

The paper tackles the memory and latency challenges of running large language models (LLMs) on low-power edge devices by introducing MEADOW, a framework that reduces off-chip memory access through a novel token-parallel head-sequential dataflow and weight packing, resulting in 1.5x lower decode latency, 2.5x lower prefill latency, and over 40% end-to-end latency improvement compared to prior methods.

The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes