LGFeb 20, 2024

Transformer tricks: Precomputing the first layer

arXiv:2402.13388v35 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This is an incremental optimization for users of transformer models like LLaMA, Mistral, PaLM, and Gemma to reduce inference costs.

The paper tackles the problem of transformer inference latency and cost by precomputing a large portion of the first layer in models with RoPE, resulting in slightly lower latency and cost-per-token, with savings ranging from 3% for a 32-layer model to 25% for a 4-layer model.

This micro-paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, PaLM, and Gemma). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model is limited to 3% savings. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes