SEJan 20Code
Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMsGuangba Yu, Zirui Wang, Yujie Huang et al.
The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape.
SEApr 19
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMsRenyi Zhong, Yichen Li, Yulun Wu et al.
Logging statements are central to debugging, failure diagnosis, and production observability, yet writing them requires developers to decide where to place a logging statement, which API and severity level to use, and what runtime information to expose. Automated logging aims to reduce this burden, but existing evidence remains dominated by Java-centric repository-snapshot dataset. It is therefore unclear whether conclusions about model behavior and model selection generalize across programming-language ecosystems or realistic code evolution. This paper presents MultiLogBench, a multilingual benchmark and empirical study spanning six programming language ecosystems. MultiLogBench contains 63,965 production-code repository-snapshot instances, 744 revision-history cases where developers introduce logging statements during maintenance, and a paired transformed revision-history branch for robustness analysis. Using seven contemporary large language models under a unified protocol, we evaluate logging-site localization, framework-anchor matching, severity prediction, message generation, variable recovery, and cascaded overall quality. Results show clear cross-language variation: framework-anchor matching is the most language-sensitive component, loop and nested-callable sites are the hardest structural contexts, and model rankings are stable only at the top tier. These patterns persist at a coarse level on revision-history data, while transformed inputs do not cause a broad same-direction performance collapse. Overall, MultiLogBench shows that robust claims about automated logging require multilingual evaluation and maintenance-oriented validation.
SEMar 11, 2024
Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachJinxi Kuang, Jinyang Liu, Junjie Huang et al.
Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.