LGAICLFeb 16, 2024

Squat: Quant Small Language Models on the Edge

Harvard
arXiv:2402.10787v220 citationsh-index: 20Has Code2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
AI Analysis

This work addresses efficiency and deployment challenges for SLMs on resource-constrained devices, offering a practical solution for mobile and edge computing applications.

The paper tackles the problem of efficiently deploying small language models (SLMs) on mobile and edge devices by proposing Squat, a quantization-aware training framework that uses entropy-guided distillation and sub-8-bit token adaptive quantization, achieving an on-device speedup of up to 2.37x compared to FP16 models.

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter training is feasible for SLMs on mobile devices, Quantization-Aware Training (QAT) is employed to improve efficiency by reducing computational overhead and memory footprint. However, previous QAT works adopt fine-grained quantization methods to compress models with billions of parameters on GPUs, incompatible with current commodity hardware, such as mobile and edge devices, which relies on Single Instruction Multiple Data (SIMD) instructions. Thus, the generalization of these methods to SLMs on mobile devices is limited. In this paper, we propose Squat method, an effective QAT framework with deployable quantization for SLMs on mobile devices. Specifically, we propose entropy-guided and distribution-aligned distillation to mitigate the distortion of attention information from quantization. Besides, we employ sub-8-bit token adaptive quantization, assigning varying bit widths to different tokens based on their importance. Furthermore, we develop a SIMD-based Multi-Kernel Mixed-Precision (MKMP) multiplier to support sub-8-bit mixed-precision MAC on mobile devices. Our extensive experiments verify the substantial improvements of our method compared to other QAT methods across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts, signaling a great advancement. Code: https://github.com/shawnricecake/squant

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes