LGApr 22

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv:2604.2102647.9

AI Analysis

For practitioners deploying LLMs on memory-constrained heterogeneous hardware, MCAP provides a lightweight, load-time solution to optimize throughput without weight modification.

MCAP is a deployment-time profiling method that estimates per-layer importance to guide precision and memory placement decisions for LLM inference on heterogeneous hardware, achieving 1.5-1.8x higher decode throughput than llama.cpp Q4_0 on NVIDIA T4 and enabling operation under previously infeasible memory constraints.

Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama.cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

View on arXiv PDF

Similar