LG AIMar 18

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

arXiv:2603.1789140.41 citationsh-index: 1

Predicted impact top 62% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses efficient on-device LLM inference for resource-constrained hardware, offering a novel method with zero-shot transfer capabilities, though it is incremental in improving quantization techniques.

The paper tackles the problem of suboptimal accuracy-efficiency trade-offs in post-training quantization for LLMs by introducing RAMP, a reinforcement learning framework that learns per-layer bit-width assignments, achieving 5.54 perplexity at 3.68GB on Llama 2 7B and outperforming uniform 4-bit methods by 6% in size and up to 3% in quality.

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

View on arXiv PDF

Similar