LGAIMay 22

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

arXiv:2605.2417156.6
Predicted impact top 41% in LG · last 90 daysOriginality Synthesis-oriented
AI Analysis

For practitioners deploying LLMs for vulnerability detection, this work highlights that prompt sensitivity must be explicitly characterized as a system property, but the findings are incremental as they confirm known issues without introducing a new method.

The paper introduces PromptAudit, a framework to evaluate how different prompting strategies affect LLM-based vulnerability detection. Testing five strategies across five models on 1,000 CVEs, they find that chain-of-thought prompting yields the best performance, while few-shot benefits are model-dependent, and adaptive chain-of-thought and self-consistency degrade performance.

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes