CRAISESep 2, 2025

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

arXiv:2509.02372v22 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses a critical security vulnerability in production LLMs that could lead to widespread malicious code generation, with significant implications for software development safety.

The researchers tackled the security risk of LLMs reproducing malicious content by developing Scam2Prompt, a scalable auditing framework that tests whether innocuous prompts trigger malicious code generation. They found that 4.24% of prompts triggered malicious URLs in four production LLMs, and in seven additional LLMs, malicious code generation rates ranged from 12.7% to 43.8%, with existing safety measures detecting less than 0.3% of cases.

Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes