Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades
For LLM safety monitoring, CTD provides a principled method to balance cost and accuracy with finite-sample guarantees on delegation rate, addressing a key bottleneck in scalable safety systems.
CTD introduces a delegation value probe that predicts whether an expert will correct a safety error, enabling model cascades with probabilistic cost guarantees. It outperforms uncertainty-based delegation across four safety datasets, avoiding harmful over-delegation while adapting to input difficulty.
Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.