SESYOct 12, 2020

CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning

arXiv:2010.06061v24 citations
Originality Incremental advance
AI Analysis

This addresses configuration errors in modern computing platforms, offering a principled solution for developers and system administrators, though it is incremental as it builds on existing causal reasoning techniques.

The paper tackles the problem of debugging and fixing misconfigurations in highly-configurable computing systems by proposing CADET, a toolkit that uses causal modeling and counterfactual reasoning to identify root causes and prescribe repairs, resulting in up to 17% higher accuracy, 28% higher gain, and 40x speed-up compared to other methods.

Modern computing platforms are highly-configurable with thousands of interacting configurations. However, configuring these systems is challenging. Erroneous configurations can cause unexpected non-functional faults. This paper proposes CADET (short for Causal Debugging Toolkit) that enables users to identify, explain, and fix the root cause of non-functional faults early and in a principled fashion. CADET builds a causal model by observing the performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reasoning over the causal model to: (a) identify the root causes of non-functional faults, (b) estimate the effects of various configurable parameters on the performance objective(s), and (c) prescribe candidate repairs to the relevant configuration options to fix the non-functional fault. We evaluated CADET on 5 highly-configurable systems deployed on 3 NVIDIA Jetson systems-on-chip. We compare CADET with state-of-the-art configuration optimization and ML-based debugging approaches. The experimental results indicate that CADET can find effective repairs for faults in multiple non-functional properties with (at most) 17% more accuracy, 28% higher gain, and $40\times$ speed-up than other ML-based performance debugging methods. Compared to multi-objective optimization approaches, CADET can find fixes (at most) $9\times$ faster with comparable or better performance gain. Our case study of non-functional faults reported in NVIDIA's forum show that CADET can find $14%$ better repairs than the experts' advice in less than 30 minutes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes