SEIRLGSep 7, 2020

Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

arXiv:2009.03257v114 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of efficient log analysis for developers or operators in continuous deployment systems, but it is incremental as it replicates and extends earlier methods to a new domain.

The study tackled the problem of automatically grouping logs from failed runs to identify underlying issues, finding that including Non-negative Matrix Factorization (NMF) significantly improves accuracy and robustness, with Complete Linkage performing best among merge criteria.

Goal: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively, and investigate the following questions: (1) Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? (2) How does dimensionality reduction affect the quality of automated log clustering? (3) How does the criterion used for merging clusters in the clustering algorithm affect clustering quality? Method: We replicate and extend earlier work on clustering system log files to assess its generalization to continuous deployment logs. We consider the optional inclusion of one of these dimensionality reduction techniques: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF). Moreover, we consider three alternative cluster merge criteria (Single Linkage, Average Linkage, and Weighted Linkage), in addition to the Complete Linkage criterion used in earlier work. We empirically evaluate the 16 resulting configurations on continuous deployment logs provided by our industrial collaborator. Results: Our study shows that (1) identifying problems in continuous deployment logs via clustering is feasible, (2) including NMF significantly improves overall accuracy and robustness, and (3) Complete Linkage performs best of all merge criteria analyzed. Conclusions: We conclude that problem identification via automated log clustering is improved by including dimensionality reduction, as it decreases the pipeline's sensitivity to parameter choice, thereby increasing its robustness for handling different inputs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes