DCNov 7, 2025

UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

arXiv:2511.03293h-index: 3
AI Analysis

Addresses memory bottlenecks in edge LLM inference by enabling efficient NPU-PIM co-execution without extra overhead.

UMDAM proposes a unified data layout and DRAM address mapping for NPU-PIM co-execution, reducing TTFT by up to 3.0x and TTLT by 2.18x on OPT models for edge LLM inference.

Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end LLM inference efficiency on edge devices.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes