Summit Haque

SE
h-index4
3papers
2citations
Novelty33%
AI Score39

3 Papers

45.2SEMay 11
Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming

Yunhan Qiao, Md Istiak Hossain Shihab, Summit Haque et al.

Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($ρ=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($ρ=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.

SEJan 27
Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

Syed Mehedi Hasan Nirob, Shamim Ehsan, Moqsadur Rahman et al.

Large language models (LLMs) have made it remarkably easy to synthesize plausible source code from natural language prompts. While this accelerates software development and supports learning, it also raises new risks for academic integrity, authorship attribution, and responsible AI use. This paper investigates the problem of distinguishing human-written from machine-generated code by comparing two complementary approaches: feature-based detectors built from lightweight, interpretable stylometric and structural properties of code, and embedding-based detectors leveraging pretrained code encoders. Using a recent large-scale benchmark dataset of 600k human-written and AI-generated code samples, we find that feature-based models achieve strong performance (ROC-AUC 0.995, PR-AUC 0.995, F1 0.971), while embedding-based models with CodeBERT embeddings are also very competitive (ROC-AUC 0.994, PR-AUC 0.994, F1 0.965). Analysis shows that features tied to indentation and whitespace provide particularly discriminative cues, whereas embeddings capture deeper semantic patterns and yield slightly higher precision. These findings underscore the trade-offs between interpretability and generalization, offering practical guidance for deploying robust code-origin detection in academic and industrial contexts.

CVJan 27
Handcrafted Feature Fusion for Reliable Detection of AI-Generated Images

Syed Mehedi Hasan Nirob, Moqsadur Rahman, Shamim Ehsan et al.

The rapid progress of generative models has enabled the creation of highly realistic synthetic images, raising concerns about authenticity and trust in digital media. Detecting such fake content reliably is an urgent challenge. While deep learning approaches dominate current literature, handcrafted features remain attractive for their interpretability, efficiency, and generalizability. In this paper, we conduct a systematic evaluation of handcrafted descriptors, including raw pixels, color histograms, Discrete Cosine Transform (DCT), Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gray-Level Co-occurrence Matrix (GLCM), and wavelet features, on the CIFAKE dataset of real versus synthetic images. Using 50,000 training and 10,000 test samples, we benchmark seven classifiers ranging from Logistic Regression to advanced gradient-boosted ensembles (LightGBM, XGBoost, CatBoost). Results demonstrate that LightGBM consistently outperforms alternatives, achieving PR-AUC 0.9879, ROC-AUC 0.9878, F1 0.9447, and a Brier score of 0.0414 with mixed features, representing strong gains in calibration and discrimination over simpler descriptors. Across three configurations (baseline, advanced, mixed), performance improves monotonically, confirming that combining diverse handcrafted features yields substantial benefit. These findings highlight the continued relevance of carefully engineered features and ensemble learning for detecting synthetic images, particularly in contexts where interpretability and computational efficiency are critical.