DCLGJul 24, 2019

Live Forensics for Distributed Storage Systems

arXiv:1907.10203v11 citations
Originality Highly original
AI Analysis

This addresses performance forensics for operators of large-scale distributed storage systems, representing a novel method for a known bottleneck.

The paper tackles the problem of diagnosing performance issues in large-scale distributed storage systems by introducing Kaleidoscope, a system that uses temporal and spatial differential observability and stochastic modeling to pinpoint root causes, achieving 95.8% accuracy in identifying real-world issues with negligible overhead.

We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes