LGApr 15

Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

arXiv:2604.1425128.9h-index: 7
Predicted impact top 16% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For LLM safety monitoring, CTD provides a principled method to balance cost and accuracy with finite-sample guarantees on delegation rate, addressing a key bottleneck in scalable safety systems.

CTD introduces a delegation value probe that predicts whether an expert will correct a safety error, enabling model cascades with probabilistic cost guarantees. It outperforms uncertainty-based delegation across four safety datasets, avoiding harmful over-delegation while adapting to input difficulty.

Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes