SEJan 30, 2020

Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

arXiv:2001.11593v249 citations
AI Analysis

This addresses the problem of detecting plagiarized code and preventing legal issues in software engineering, but it is incremental as it builds on existing research with a focus on practical applicability.

The paper tackles authorship attribution of source code by introducing a language-agnostic approach and highlighting limitations in existing datasets, showing that high accuracy on synthetic data drops significantly on more realistic data.

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes