CURing Large Models: Compression via CUR Decomposition
This addresses resource-intensive challenges for users of large models, but it is incremental as it builds on existing CUR decomposition techniques.
The paper tackles the problem of compressing large deep learning models to reduce memory usage by introducing CURing, a method based on CUR matrix decomposition that selects informative rows and columns, resulting in a 9% parameter reduction for Llama3.1-8B with minimal performance loss and over 20 times faster compression than prior methods.
Large deep learning models have achieved remarkable success but are resource-intensive, posing challenges such as memory usage. We introduce CURing, a novel model compression method based on CUR matrix decomposition, which approximates weight matrices as the product of selected columns (C) and rows (R), and a small linking matrix (U). We apply this decomposition to weights chosen based on the combined influence of their magnitudes and activations. By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss. For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.