LG NAJan 29

LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen

arXiv:2601.21623v11.4h-index: 5

Originality Highly original

AI Analysis

This work addresses efficient deployment of large language models, offering a domain-specific incremental improvement for transformer inference.

The paper tackles the problem of efficient transformer inference by proposing an adaptive mixed-precision strategy that selects a small subset of components for higher accuracy, achieving up to two orders of magnitude improvement in accuracy with low recomputation rates in GPT-2 models.

Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.

View on arXiv PDF

Similar