VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
This work challenges the paradigm of general-purpose LLM serving stacks by proposing generation-time specialization, which could benefit system builders and researchers dealing with diverse model architectures, workloads, and hardware.
VibeServe introduces an agentic loop that automatically synthesizes bespoke LLM serving systems for different usage scenarios, outperforming generic systems like vLLM in non-standard scenarios by exploiting opportunities that generic systems miss.
For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.