CLAICRLGJul 25, 2025

MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

arXiv:2507.19598v14 citationsh-index: 18EMNLP
Originality Highly original
AI Analysis

This addresses security risks for users of code generation models by exposing and mitigating vulnerabilities to adversarial misuse.

The paper tackles the problem of code language models being vulnerable to multi-turn malicious coding prompts that evade safety filters, and finds that fine-tuning on their benchmark improves rejection rates by up to 32.4% on external datasets.

Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes