CYApr 17

Evidence Sufficiency Under Delayed Ground Truth: Proxy Monitoring for Risk Decision Systems

arXiv:2604.1574055.41 citations

AI Analysis

For governance of risk decision systems (e.g., fraud detection, credit scoring), this provides a formal framework to quantify evidence degradation due to label latency, but the mapping to specific governance actions requires further calibration.

This paper formalizes an evidence sufficiency model for machine learning systems under delayed ground truth, with four dimensions and a decision-readiness gate. Proxy monitoring detects covariate and mixed drift with 100% detection rate on the IEEE-CIS Fraud Detection dataset, while concept drift remains undetected as theoretically expected.

Machine learning systems in fraud detection, credit scoring, and clinical risk assessment operate under delayed ground truth: outcome labels arrive days to months after the decision they evaluate. During this blind period, governance evidence degrades through mechanisms that neither drift detection methods nor governance frameworks adequately address. This paper formalizes an evidence sufficiency model with four dimensions (completeness, freshness, reliability, representativeness) and a decision-readiness gate that quantifies how label latency degrades evidence quality. The model maps three drift types to dimension-specific degradation trajectories. A complementary proxy indicator framework comprising seven measurement categories estimates sufficiency degradation without labels, with explicit coverage mapping and characterized blind spots per drift type. Evaluation on the IEEE-CIS Fraud Detection dataset (~590K transactions) with controlled drift injection shows that composite proxy monitoring detects covariate and mixed drift with 100% detection rate, while concept drift without feature change remains undetected -- consistent with the theoretical impossibility of unsupervised detection when P(X) is unchanged. Blind period simulation confirms monotone sufficiency degradation, with concept drift degrading fastest (S=0.242 at day 60 vs 0.418 for no-drift). The framework contributes a governance sufficiency monitoring instrument; its value lies in translating drift signals into auditable sufficiency assessments with characterized blind spots. Mapping sufficiency levels to governance actions requires deployment-specific calibration beyond this study's scope.

View on arXiv PDF

Similar