TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge
This work addresses the deployment of LLMs on resource-constrained edge devices by providing a novel, verifiable compression method, though it is incremental in applying formal methods to a specific domain.
The paper tackles the problem of compressing large language models for edge deployment by proposing TOGGLE, a framework that uses Signal Temporal Logic to enforce linguistic properties during compression, achieving up to 3.3x reduction in computational costs and 68.8% reduction in model size while preserving specified constraints.
Large Language Models (LLMs) deliver exceptional performance across natural language tasks but demand substantial computational resources, limiting their deployment on resource-constrained edge devices. Existing compression techniques, such as quantization and pruning, often degrade critical linguistic properties and lack formal guarantees for preserving model behavior. We propose Temporal Logic-Guided Large Language Model Compression (TOGGLE), a novel framework that leverages Signal Temporal Logic (STL) to formally specify and enforce linguistic properties during compression. TOGGLE employs an STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that formally satisfy specified linguistic constraints without retraining or fine-tuning. Evaluating TOGGLE on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B), we achieve up to 3.3x reduction in computational costs (FLOPs) and up to a 68.8% reduction in model size while satisfying all linguistic properties. TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware.