Eray Tüzün

SE
3papers
Novelty37%
AI Score38

3 Papers

SEMar 26
LACY: Simulating Expert Mentoring for Software Onboarding with Code Tours

Zeynep Begüm Kara, Aytekin İsmail, Ece Ateş et al.

Every software organization faces the onboarding challenge: helping newcomers navigate complex codebases, compensate for insufficient documentation, and comprehend code they did not author. Expert walkthroughs are among the most effective forms of support, yet they are expensive, repetitive, and do not scale. We present Lacy, a hybrid human-AI onboarding system that captures expert mentoring in reusable code tours-to our knowledge, the first hybrid approach combining AI-generated content with expert curation in code tours. Our design is grounded in requirements derived from 20+ meetings, surveys, and interviews across a year-long industry partnership with Beko. Supporting features include Voice-to-Tour capture, comprehension quizzes, podcasts, and a dashboard. We deployed Lacy on Beko's production environment and conducted a controlled study on a legacy finance system (30K+ LOC). Learners using expert-guided tours achieved 83% quiz scores versus 57% for AI-only tours, preferred tours over traditional self-study, and reported they would need fewer expert consultations. Experts found tour creation less burdensome than live walkthroughs. Beko has since adopted Lacy for organizational onboarding, and we release our code and study instruments as a replication package.

SEMar 26
Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

SEApr 1
SERSEM: Selective Entropy-Weighted Scoring for Membership Inference in Code Language Models

Kıvanç Kuzey Dikici, Serdar Kara, Semih Çağlar et al.

As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level Z-scores from the output logits. Evaluated on a 25,000-sample balanced dataset, SERSEM achieves a global AUC-ROC of 0.7913 on the StarCoder2-3B model and 0.7867 on the StarCoder2-7B model, consistently outperforming the implemented probability-based baselines Loss, Min-K% Prob, and PAC. Our findings demonstrate that focusing on human-centric coding anomalies provides a significantly more robust indicator of verbatim memorization than sequence-level probability averages.