CLAILGNov 4, 2025

Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

arXiv:2511.02681v1
Originality Incremental advance
AI Analysis

This addresses storage challenges for deploying fine-tuned LLMs across diverse applications, but it is incremental as it builds on known low-rank and sparse properties of updates.

The paper tackles the problem of efficiently storing fine-tuned large language models by proposing optimal singular damage, a method that selectively sparsifies low-rank approximated updates to retain impactful components, achieving significant storage efficiency and superior accuracy within the same memory budget compared to existing techniques.

Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes