SEMay 29

Reassessing Code Authorship Attribution in the Era of Language Models

Atish Kumar Dipongkor, Ziyu Yao, Kevin Moran

arXiv:2506.171205.23 citationsh-index: 4

Predicted impact top 76% in SE · last 90 daysOriginality Incremental advance

AI Analysis

This research addresses the need for more robust and effective methods for Code Authorship Attribution, which is crucial for automating software engineering tasks and for cybersecurity applications like plagiarism detection.

This paper investigates the effectiveness of transformer-based Language Models (LMs) for Code Authorship Attribution (CAA), a task previously limited by hand-crafted features. The study applies seven code LMs to six datasets comprising 12,000 code snippets from 463 developers, analyzing their performance and behavior in understanding stylometric code patterns.

The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism. Past techniques for CAA tend to leverage hand-crafted code-related features typically carry limitations that prevent proper authorship characterization and lead to sensitivities to adversarial attacks. Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of SE tasks, and in authorship attribution for natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA on six diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth quantitative and qualitative analysis of our studied models' performance on CAA using established interpretability techniques. Our results illustrate important aspects of the behavior of LMs in understanding stylometric code patterns.

View on arXiv PDF

Similar