CR AIFeb 1

CIPHER: Cryptographic Insecurity Profiling via Hybrid Evaluation of Responses

Max Manolov, Tony Gao, Siddharth Shukla, Cheng-Ting Chou, Ryan Lagasse

arXiv:2602.01438v1

Originality Incremental advance

AI Analysis

This addresses security risks for developers using LLMs for code generation, but it is incremental as it builds on existing benchmarking and evaluation methods.

The paper tackles the problem of exploitable cryptographic flaws in LLM-generated Python code by introducing CIPHER, a benchmark for measuring vulnerability incidence under controlled security-guidance conditions, finding that explicit secure prompting reduces some issues but does not reliably eliminate vulnerabilities overall.

Large language models (LLMs) are increasingly used to assist developers with code, yet their implementations of cryptographic functionality often contain exploitable flaws. Minor design choices (e.g., static initialization vectors or missing authentication) can silently invalidate security guarantees. We introduce CIPHER(\textbf{C}ryptographic \textbf{I}nsecurity \textbf{P}rofiling via \textbf{H}ybrid \textbf{E}valuation of \textbf{R}esponses), a benchmark for measuring cryptographic vulnerability incidence in LLM-generated Python code under controlled security-guidance conditions. CIPHER uses insecure/neutral/secure prompt variants per task, a cryptography-specific vulnerability taxonomy, and line-level attribution via an automated scoring pipeline. Across a diverse set of widely used LLMs, we find that explicit ``secure'' prompting reduces some targeted issues but does not reliably eliminate cryptographic vulnerabilities overall. The benchmark and reproducible scoring pipeline will be publicly released upon publication.

View on arXiv PDF

Similar