LGAICLPFApr 6, 2025

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

arXiv:2504.05346v17 citationsh-index: 5
Originality Incremental advance
AI Analysis

This provides a practical solution for deploying large models in resource-constrained environments, though it is incremental as it builds on existing pruning techniques.

The paper tackles the problem of reducing memory and computational costs of large language models through a novel weight-pruning algorithm called Thanos, achieving state-of-the-art performance in structured pruning and outperforming existing methods in unstructured pruning.

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes