Accelerating Language Model Workflows with Prompt Choreography
This work addresses latency and computational redundancy in LLM-based multi-agent workflows, offering incremental improvements through caching and fine-tuning.
The paper tackles the problem of inefficient multi-agent workflows with large language models by introducing Prompt Choreography, a framework that uses a dynamic global KV cache to reduce redundant computations, resulting in 2.0–6.2× faster per-message latency and >2.2× end-to-end speedups in certain workflows.
Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.