Transformer tricks: Precomputing the first layer
This is an incremental optimization for users of transformer models like LLaMA, Mistral, PaLM, and Gemma to reduce inference costs.
The paper tackles the problem of transformer inference latency and cost by precomputing a large portion of the first layer in models with RoPE, resulting in slightly lower latency and cost-per-token, with savings ranging from 3% for a 32-layer model to 25% for a 4-layer model.
This micro-paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, PaLM, and Gemma). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model is limited to 3% savings. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.