DBMay 28
Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at ScaleMohit Verma, Shantanu Rawat, Christian Bush et al.
Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query planning, yet they overlap substantially with data quality (DQ) monitoring needs. We describe a metadata-first approach that repurposes write-time statistics for continuous DQ observability: anomaly detection, drift monitoring, null-rate tracking; without scanning any data. Deployed at LinkedIn across 200,000+ Iceberg tables (800+ PB), this approach satisfies approximately 60% of user-defined DQ rules at zero marginal compute cost and reduces profiling resource consumption by around 50%. Extending manifest statistics with lightweight counters (sum, zero-value counts, boolean counts) and incrementally mergeable sketches; Theta sketches for distinct counts, KLL sketches for quantiles; can further raise metadata-satisfiable coverage to close to 90% of production DQ rules. We validate sketch accuracy, mergeability, and storage overhead on production data and propose that table formats should store per-file sketches in Puffin sidecar files, following the same store-then-aggregate pattern used for existing manifest statistics.
DCFeb 5, 2024
Dependency Aware Incident Linking in Large Cloud SystemsSupriyo Ghosh, Karish Grover, Jimmy Wong et al.
Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer's satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that creates several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amount of manual toil and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverages textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual toil.
AIDec 17, 2015
A thermodynamical approach towards multi-criteria decision making (MCDM)Mohit Verma, J. Rajasankar
In multi-criteria decision making (MCDM) problems, ratings are assigned to the alternatives on different criteria by the expert group. In this paper, we propose a thermodynamically consistent model for MCDM using the analogies for thermodynamical indicators - energy, exergy and entropy. The most commonly used method for analysing MCDM problem is Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS). The conventional TOPSIS method uses a measure similar to that of energy for the ranking of alternatives. We demonstrate that the ranking of the alternatives is more meaningful if we use exergy in place of energy. The use of exergy is superior due to the inclusion of a factor accounting for the quality of the ratings by the expert group. The unevenness in the ratings by the experts is measured by entropy. The procedure for the calculation of the thermodynamical indicators is explained in both crisp and fuzzy environment. Finally, two case studies are carried out to demonstrate effectiveness of the proposed model.