LGAICLJun 27, 2025

Projected Compression: Trainable Projection for Efficient Transformer Compression

arXiv:2506.22255v1h-index: 6
Originality Incremental advance
AI Analysis

This addresses the need for efficient model compression in AI to reduce inference time and computational costs, offering a novel method that is incremental in improving upon existing compression techniques.

The paper tackles the problem of reducing the size and computational demands of large language models by introducing Projected Compression, a technique that uses trainable projection modules to compress model weights into a lower-dimensional product matrix, resulting in a reduced-size Transformer model that matches the base model's FLOPs per token and outperforms hard pruning and retraining approaches, with performance scaling well with token count.

Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model's per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes