Diffusion-based Time Series Data Imputation for Microsoft 365
This work addresses data quality issues in cloud reliability for large-scale systems like Microsoft 365, though it appears incremental as it applies a diffusion-based method to a specific domain.
The paper tackles the problem of poor data quality due to missing values in cloud failure prediction for Microsoft 365 by proposing Diffusion+, a sample-efficient diffusion model for data imputation, which improves the performance of downstream failure prediction tasks as shown in experiments and application practice.
Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing works focus on predicting cloud failures and proactively taking action before failures happen. However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently based on the observed data. Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task.