DCLGOSJan 14

LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference

arXiv:2601.09258v11 citations
Originality Incremental advance
AI Analysis

This addresses latency management for LLM services in production, offering a non-intrusive solution to improve user experience and operational costs, though it is incremental as it builds on existing profiling methods.

The paper tackles the problem of latency spikes in LLM inference by introducing LatencyPrism, a zero-intrusion system that monitors and alerts on latency anomalies with an F1-score of 0.98, enabling SLO guarantees without service restarts.

LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes