LGAPMLJun 16, 2025

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

arXiv:2506.13593v42 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses safety risk assessment for generative AI models, offering a prompt-adaptive evaluation method, though it is incremental in applying existing statistical techniques to a new domain-specific problem.

The paper tackles the problem of estimating time-to-unsafe-sampling in large language models, a safety measure for rare unsafe outputs, by proposing a novel calibration technique based on conformal prediction to provide lower predictive bounds with rigorous coverage guarantees, achieving improved sample efficiency in experiments.

We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes