The Security Budget of Code LLMs: An Information-Theoretic Capacity-Security Bound
For developers and security researchers using code LLMs, this provides a theoretical framework to understand the trade-off between model capability and vulnerability to prompt perturbations.
The paper derives an information-theoretic bound between functional capacity and perturbation retention for code LLMs, proving that their sum is bounded by task entropy and prompt leakage. Empirical validation across models and datasets shows the bound holds with saturation 0.27–0.92 and slack 2.36–26.94 nats.
AI programming assistants make natural-language prompts a software-development interface, so small prompt perturbations become usability and security risks. We study an information-theoretic trade-off for code LLMs between functional capacity, $\Cap=\rmI(c^*;c_π)$, and perturbation retention, $\Sec=\rmI(c_π;\tilde c_π)$. Here $\Sec$ is a retention-channel quantity, not a direct measure of exploit success or vulnerable-code generation. For code completion modeled as $p\to c_π$ with perturbed prompt $\tilde p$, we prove $\Cap+\Sec\le \rmH(c^*)+\rmI(p;\tilde p)$, decomposing the budget into task entropy and prompt leakage. A deterministic-embedding corollary gives the hidden-state version, and a tokenizer/gzip companion bound gives a model-agnostic ceiling on sequence-level task entropy. Empirically, we estimate embedded $\Cap$ and $\Sec$ from output-only last-token hidden states, excluding prompt context from the $\Sec$ channel. Six individual validation rows across two models, two datasets, INT4/BF16 precision, and estimator ablations satisfy the embedded check $(\Cap+\max_T\Sec)/(\rmH(z^*)+\max_T\rmI(p;\tilde p))\le1$. Saturation is 0.27--0.92 and theorem slack is 2.36--26.94 nats; a separate three-seed stability diagnostic has mean saturation 0.87. A context-mixed cosine, used only as a per-problem generation-prompt alignment signal, correlates with pass@1 on CodeLlama-HumanEval ($ρ{=}0.36$, $p{<}10^{-4}$), Qwen-HumanEval ($ρ{=}0.22$, $p{=}0.005$), and CodeLlama-MBPP ($ρ{=}0.225$, $p{=}0.0038$; all $n{=}164$). Adaptive stress tests with a 23-perturbation pool, a fixed universal suffix, and prompt-embedding PGD all leave positive slack.