Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
For practitioners and researchers concerned with model extraction via API access, this work provides theoretical guarantees and empirical evidence that top-K logit censoring is insufficient to prevent capability extraction, separating fidelity from transfer.
The paper studies the limits of recovering per-position token distributions from LLM APIs that only reveal top-K logit scores. It derives exact bounds on total-variation diameter and KL divergence for the identified set of compatible distributions, and shows experimentally that while top-K censoring limits distribution recovery, it does not prevent capability extraction: generation-based extraction recovers 96% of private capability vs 12% for top-K distillation.
Modern LLM APIs often reveal only top-$K$ logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold $τ$, the compatible teacher distributions form an identified set whose total-variation diameter is exactly $U_K=(V-K)\exp(τ)/(Z_A+(V-K)\exp(τ))$, where $Z_A$ is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-$K$ distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-$K$ censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.