SELGAug 15, 2020

Site Reliability Engineering: Application of Item Response Theory to Application Deployment Practices and Controls

arXiv:2008.06717v11 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of optimizing deployment practices for SRE teams, though it appears incremental by applying existing statistical methods to a new domain.

The paper tackles the challenge of balancing application reliability with deployment velocity in Site Reliability Engineering by proposing an Application Deployment Score based on Item Response Theory, which helps assess deployment improvements, adjust error budgets, and evaluate the effectiveness of deployment controls.

Reliability of an application or solution in production environment is one of the fundamental features where every SRE team is critically focused upon. At the same time achieving extreme reliability comes with the cost which include but not limited to slow pace of new feature deployments, operations cost and opportunity cost. One such earlier effort in giving an objective metric to strike the fine balance between acceptable reliability and product velocity is error budget and its associated policy. There are also contemporary deployment guidelines and controls per organization to ascertain the reliability of an application deployment version into customer facing or production environments. This work proposes new objective metrics called Application Deployment Score estimated using dichotomous Item Response Theory model. This score is used to assess the improvement trend of each application version deployed into customer facing environment, identify the improvement scope for each application deployment in each area of deployment guidelines and controls, adjust the error budget i.e. soft error budget of a interdependent application in application mesh by giving soft collective responsibility and finally defines a new metric called deployment index which helps to assess the effectiveness of these contemporary deployment guidelines and controls in upholding the agreed SLOs of the application in customer facing environments. This study opens a new field of research in developing new underlying latent indexes (i.e. new objective metrics) in SRE and DevOps space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes