LG AIJun 2

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Devleena Das, Rajeev Patwari, Elliott Delaye, Ashish Sirasao

arXiv:2606.0423857.4

AI Analysis

For practitioners deploying LLMs on memory-constrained devices, this work provides a practical post-quantization tool to recover accuracy lost from aggressive 2-bit quantization without requiring labeled data.

Recover-LoRA, a data-free accuracy recovery method, is extended to 2-bit quantized LLMs. Using a mixed-precision strategy (W4/W2-GateUp) and low-rank adapters trained via logit distillation on synthetic data, it achieves 80-95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B with only 10k synthetic samples, and delivers 7.5-23.3% TPS improvement over uniform W4.

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.

View on arXiv PDF

Similar