LGMay 13, 2025

Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression

Jacob Sander, David Moe, Achraf Cohen, Brent Venable, Venkat Dasari, Brian Jalaian

arXiv:2505.18166v14.11 citationsh-index: 2MILCOM

Originality Incremental advance

AI Analysis

This work addresses the challenge of deploying AI models in resource-constrained edge environments, but it is incremental as it focuses on comparing loss functions rather than introducing a new compression method.

The study tackled the problem of compressing large language models for edge deployment by comparing fine-tuning with cross-entropy loss versus self-distillation with KL-divergence loss after MLP-only pruning, finding that KL-based distillation matched or exceeded fine-tuning in test accuracy under identical pruning conditions.

Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test accuracy, demonstrating that, even with a basic MLP-only pruning, the choice of loss function materially affects compressed model recovery in resource-constrained environments.

View on arXiv PDF

Similar