STRATA: Simple, Gradient-Free Attacks for Models of Code
This work addresses the challenge of creating functional-preserving adversarial attacks for code models, which is important for security testing in software engineering.
The authors tackled the problem of generating adversarial examples for source code models by identifying a relationship between token frequency and embedding norms, and developed a gradient-free method that outperforms gradient-based approaches with less computational effort.
Neural networks are well-known to be vulnerable to imperceptible perturbations in the input, called adversarial examples, that result in misclassification. Generating adversarial examples for source code poses an additional challenge compared to the domains of images and natural language, because source code perturbations must retain the functional meaning of the code. We identify a striking relationship between token frequency statistics and learned token embeddings: the L2 norm of learned token embeddings increases with the frequency of the token except for the highest-frequnecy tokens. We leverage this relationship to construct a simple and efficient gradient-free method for generating state-of-the-art adversarial examples on models of code. Our method empirically outperforms competing gradient-based methods with less information and less computational effort.