Daniel M. Muepu

2papers

2 Papers

23.5SEMar 26
Error Understanding in Program Code With LLM-DL for Multi-label Classification

Md Faizul Ibne Amin, Yutaka Watanobe, Md. Mostafizer Rahman et al.

Programming is a core skill in computer science and software engineering (SE), yet identifying and resolving code errors remains challenging for both novice and experienced developers. While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation tasks, their potential in domain-specific, complex scenarios, such as multi-label classification (MLC) of programming errors, remains underexplored. Recognizing this less-explored area, this study proposes a multi-label error classification (MLEC) framework for source code that leverages fine-tuned LLMs, including CodeT5-base, GraphCodeBERT, CodeT5+, UniXcoder, RoBERTa, PLBART, and CoTexT. These LLMs are integrated with deep learning (DL) architectures such as GRU, LSTM, BiLSTM, and BiLSTM with an additive attention mechanism (BiLSTM-A) to capture both syntactic and semantic features from a real-world student-written Python code error dataset. Extensive experiments across 32 model variants, optimized using Optuna-based hyperparameter tuning, have been evaluated using comprehensive multi-label metrics, including average accuracy, macro and weighted precision, recall, F1-score, exact match accuracy, One-error, Hamming loss, Jaccard similarity, and ROC-AUC (micro, macro, and weighted). Results show that the CodeT5+\_GRU model achieved the strongest performance, with a weighted F1-score of 0.8243, average accuracy of 91.84\%, exact match accuracy of 53.78\%, Hamming loss of 0.0816, and One error of 0.0708. These findings confirm the effectiveness of combining pretrained semantic encoders with efficient recurrent decoders. This work lays the foundation for developing intelligent, scalable tools for automated code feedback, with potential applications in programming education (PE) and broader SE domains.

65.2SEApr 30
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

Md Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu et al.

LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge agreement (Cohen's and Fleiss' k). Human-AI co-creation is further examined through trajectory-level signals, including turn-wise confidence, Success-at-Turn, time-to-success, revision churn, and CodeBLEU. Co-creation success is found to concentrate early, with Success-at-Turn rising to 0.8533 at the first observed turn and stabilizing at 0.8641 by turn 6. Revision behavior, however, remains heterogeneous, suggesting that productive progress can emerge through either incremental refinement or broader restructuring. On the judging side, the best held-out scores reach 0.5937 for ROC-AUC, 0.6904 for PR-AUC, and 0.5000 for MCC test, while inter-judge consistency remains modest overall (mean pairwise Cohen's k = 0.1592, Fleiss' k = 0.0696). Taken together, this work offers an auditable and reproducible evaluation methodology that links reliability-aware LLM judging with trajectory-based analysis of human-AI co-creation, providing a practical evaluation template for future AI-assisted coding and SE.