LGMay 27

LLM Zeroth-Order Fine-Tuning is an Inference Workload

arXiv:2605.2876014.4
Predicted impact top 30% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners fine-tuning large language models with zeroth-order methods, this work provides a practical system-level optimization that significantly reduces runtime without sacrificing accuracy.

The paper identifies that zeroth-order fine-tuning of LLMs is dominated by inference-style scoring, not training. By executing the scoring phase through a serving runtime (vLLM), they achieve 8.13x speedup on OPT-13B SST-2 (0.51 vs 4.15 hours) with comparable accuracy, and 2.34x–7.72x speedups across model scales.

Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes