LGMay 21, 2024

ReALLM: A general framework for LLM compression and fine-tuning

arXiv:2405.13155v15 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of deploying LLMs in resource-constrained environments, offering a significant improvement in compression efficiency, though it is incremental in building upon existing quantization and fine-tuning methods.

The paper tackles the problem of compressing and fine-tuning large language models (LLMs) for memory efficiency, introducing ReALLM, a framework that achieves state-of-the-art performance on language generation tasks with budgets as low as 2 bits after fine-tuning.

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $\mathcal{D}_φ$ with its weights on $b_φ$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes