Inference economics of language models
This work addresses cost and efficiency challenges for organizations deploying LLMs at scale, though it is incremental as it builds on existing optimization techniques.
The authors tackled the economic trade-off between cost per token and serial token generation speed in large language model inference at scale, developing a theoretical model that optimizes parallelism and batch sizes to compute Pareto frontiers for popular models.
We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.