CLLGMLJul 30, 2024

Accelerating Large Language Model Inference with Self-Supervised Early Exits

arXiv:2407.21082v25 citationsh-index: 2
AI Analysis

This work addresses the computational bottleneck in deploying large language models, offering a practical acceleration method that is incremental but effective for real-world applications.

The paper tackles the problem of high inference cost in large language models by introducing a modular early exit method with self-supervised training, achieving significant reductions in cost while maintaining accuracy across benchmarks and a 1.66x higher token acceptance rate in speculative decoding compared to baselines.

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes