Krony-PT: GPT2 compressed with Kronecker Products
This addresses model compression for large language models like GPT-2, offering a more efficient alternative, though it appears incremental as it builds on existing Kronecker-based methods.
The paper tackles compressing GPT-2 using Kronecker products, specifically targeting feed-forward weights, and results in models from 80M to 96M parameters, with an 81M variant outperforming DistilGPT2 on next-token prediction across standard datasets.
We introduce Krony-PT, a compression technique for GPT-2 based on Kronecker products. We specifically target the feed-forward weights of each transformer block, and systematically compress the feed-forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize new Kronecker factors, and also propose a new pruning-based initialization technique. Our method compresses the original 124M-parameter GPT-2 to various smaller models, ranging from 80M to 96M. Our 81M model variant outperforms DistilGPT2 on next-token prediction across all standard language modeling datasets, and shows competitive or comparable performance with significantly larger Kronecker-based compressions of GPT-2.