SEAIDec 21, 2024

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

arXiv:2412.16594v38 citationsh-index: 13SIU
Originality Synthesis-oriented
AI Analysis

This work addresses ethical issues in software development for educators and employers, but it is incremental as it focuses on creating a new dataset and testing baseline methods.

The authors tackled the problem of detecting AI-generated code to address ethical concerns in job interviews and student assignments, presenting AIGCodeSet, a dataset with 2,828 AI-generated and 4,755 human-written Python codes, and found that a Bayesian classifier outperformed other baseline detection methods.

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes