DCApr 22

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang

arXiv:2509.197299.73 citationsh-index: 3

Predicted impact top 16% in DC · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of optimizing throughput and memory usage for LLM inference services under variable workloads, representing an incremental improvement over existing parallelism strategies.

The paper tackles the challenge of efficiently handling varying context lengths in LLM inference services by proposing Amoeba, a runtime Tensor Parallel transformation that adaptively adjusts parallelism based on request dynamics, resulting in throughput improvements of 1.75x to 6.57x compared to state-of-the-art solutions.

In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

View on arXiv PDF

Similar