SE CLApr 28, 2025

Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge

Wenhan Mu, Ling Xu, Shuren Pei, Le Mi, Huichi Zhou

arXiv:2504.19730v111.37 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses security risks in software engineering tools using code language models, offering a practical defense against adversarial attacks, though it is incremental as it builds on existing attack detection insights.

The paper tackles the vulnerability of code language models to adversarial attacks, specifically identifier substitution attacks, by proposing EP-Shield, a framework that evaluates and purifies adversarial code using LLM-as-a-Judge, achieving up to 83.36% improvement over adversarial fine-tuning with a lightweight 7B-parameter design.

The widespread adoption of code language models in software engineering tasks has exposed vulnerabilities to adversarial attacks, especially the identifier substitution attacks. Although existing identifier substitution attackers demonstrate high success rates, they often produce adversarial examples with unnatural code patterns. In this paper, we systematically assess the quality of adversarial examples using LLM-as-a-Judge. Our analysis reveals that over 80% of adversarial examples generated by state-of-the-art identifier substitution attackers (e.g., ALERT) are actually detectable. Based on this insight, we propose EP-Shield, a unified framework for evaluating and purifying identifier substitution attacks via naturalness-aware reasoning. Specifically, we first evaluate the naturalness of code and identify the perturbed adversarial code, then purify it so that the victim model can restore correct prediction. Extensive experiments demonstrate the superiority of EP-Shield over adversarial fine-tuning (up to 83.36% improvement) and its lightweight design 7B parameters) with GPT-4-level performance.

View on arXiv PDF

Similar