LGJun 18, 2025Code
I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship AttributionTamas Bisztray, Bilel Cherif, Richard A. Dubniczky et al.
Detecting AI-generated code, deepfakes, and other synthetic content is an emerging research challenge. As code generated by Large Language Models (LLMs) becomes more common, identifying the specific model behind each sample is increasingly important. This paper presents the first systematic study of LLM authorship attribution for C programs. We released CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture, discarding the decoder to focus on classification. Our model's encoder output (first token) is passed through a two-layer classification head with GELU activation and dropout, producing a probability distribution over possible authors. To evaluate our approach, we introduce LLM-AuthorBench, a benchmark of 32,000 compilable C programs generated by eight state-of-the-art LLMs across diverse tasks. We compare our model to seven traditional ML classifiers and eight fine-tuned transformer models, including BERT, RoBERTa, CodeBERT, ModernBERT, DistilBERT, DeBERTa-V3, Longformer, and LoRA-fine-tuned Qwen2-1.5B. In binary classification, our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models such as GPT-4.1 and GPT-4o, and 95.40% accuracy for multi-class attribution among five leading LLMs (Gemini 2.5 Flash, Claude 3.5 Haiku, GPT-4.1, Llama 3.3, and DeepSeek-V3). To support open science, we release the CodeT5-Authorship architecture, the LLM-AuthorBench benchmark, and all relevant Google Colab scripts on GitHub: https://github.com/LLMauthorbench/.
CROct 14, 2021
Privacy Impact Assessment: Comparing methodologies with a focus on practicalityTamas Bisztray, Nils Gruschka
Privacy and data protection have become more and more important in recent years since an increasing number of enterprises and startups are harvesting personal data as a part of their business model. One central requirement of the GDPR is the implementation of a data protection impact assessment for privacy critical systems. However, the law does not dictate or recommend the use of any particular framework. In this paper we compare different data protection impact assessment frameworks. We have developed a comparison and evaluation methodology and applied this to three popular impact assessment frameworks. The result of this comparison shows the weaknesses and strengths, but also clearly indicates that none of the tested frameworks fulfill all desired properties. Thus, the development of a new or improved data protection impact assessment framework is an important open issue for future work, especially for sector specific applications.
CRNov 20, 2018
Privacy Issues and Data Protection in Big Data: A Case Study Analysis under GDPRNils Gruschka, Vasileios Mavroeidis, Kamer Vishi et al.
Big data has become a great asset for many organizations, promising improved operations and new business opportunities. However, big data has increased access to sensitive information that when processed can directly jeopardize the privacy of individuals and violate data protection laws. As a consequence, data controllers and data processors may be imposed tough penalties for non-compliance that can result even to bankruptcy. In this paper, we discuss the current state of the legal regulations and analyze different data protection and privacy-preserving techniques in the context of big data analysis. In addition, we present and analyze two real-life research projects as case studies dealing with sensitive data and actions for complying with the data regulation laws. We show which types of information might become a privacy risk, the employed privacy-preserving techniques in accordance with the legal requirements, and the influence of these techniques on the data processing phase and the research results.