Shazibul Islam Shamim

SE
3papers
Novelty48%
AI Score48

3 Papers

30.6SEMar 29Code
Safer Builders, Risky Maintainers: A Comparative Study of Breaking Changes in Human vs Agentic PRs

K M Ferdous, Dipayan Banik, Kowshik Chowdhury et al.

AI coding agents are increasingly integrated into modern software engineering workflows, actively collaborating with human developers to create pull requests (PRs) in open-source repositories. Although coding agents improve developer productivity, they often generate code with more bugs and security issues than human-authored code. While human-authored PRs often break backward compatibility, leading to breaking changes, the potential for agentic PRs to introduce breaking changes remains underexplored. The goal of this paper is to help developers and researchers evaluate the reliability of AI-generated PRs by examining the frequency and task contexts in which AI agents introduce breaking changes. We conduct a comparative analysis of 7,191 agent-generated PRs with 1402 human-authored PRs from Python repositories in the AIDev dataset. We develop a tool that analyzes code changes in commits corresponding to the agentic PRs and leverages an abstract syntax tree (AST) based analysis to detect potential breaking changes. Our findings show that AI agents introduce fewer breaking changes overall than humans (3.45% vs. 7.40%) in code generation tasks. However, agents exhibit substantially higher risk during maintenance tasks, with refactoring and chore changes introducing breaking changes at rates of 6.72% and 9.35%, respectively. We also identify a "Confidence Trap" where highly confident agentic PRs still introduce breaking changes, indicating the need for stricter review during maintenance oriented changes regardless of reported confidence score.

35.7SEApr 3Code
From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests

Kowshik Chowdhury, Dipayan Banik, K M Ferdous et al.

Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understanding the effectiveness of CRA reviews is crucial for maintaining developmental workflows and preventing wasted effort on abandoned pull requests. However, empirical evidence on how CRA feedback quality affects PR outcomes remains limited. The goal of this paper is to help researchers and practitioners understand when and how CRAs influence PR merge success by empirically analyzing reviewer composition and the signal quality of CRA-generated comments. From AIDev's 19,450 PRs, we analyze 3,109 unique PRs in the commented review state, comparing human-only versus CRA-only reviews. We examine 98 closed CRA-only PRs to assess whether low signal-to-noise ratios contribute to abandonment. CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. Our signal-to-noise analysis reveals that 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback. These findings suggest that CRAs without human oversight often generate low-signal feedback associated with higher abandonment. For practitioners, our results indicate that CRAs should augment rather than replace human reviewers and that human involvement remains critical for effective and actionable code review.

15.4SEMay 12
Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation

Kowshik Chowdhury, Dipayan Banik, Shazibul Islam Shamim

Dependency resolution is the task of selecting package versions that can be installed together without conflicts. It accounts for a significant share of build failures in modern software projects. In the Python ecosystem, this task is especially challenging due to Python 2/3 incompatibilities, deprecated packages, and widespread missing metadata. Recent work, such as PLLM, tackles this problem by using large language models (LLMs) to infer Python and package versions from code and iteratively repairing them based on build errors. We present SMT-LLM, a hybrid system that replaces LLM-only version guessing with formal constraint solving. SMT-LLM uses deterministic import extraction and Python version detection via abstract syntax tree (AST) analysis, the vermin tool to infer minimum Python versions, and a five-tier import-to-package resolver that queries PyPI before any LLM call. We construct a constraint graph from PyPI metadata and LLM-imputed dependencies for packages with missing metadata, then solve for consistent version assignments using a Z3 satisfiability modulo theories (SMT) solver. On the HG2.9K benchmark using Gemma2:9B (10 GB VRAM), SMT-LLM resolves 83.6% of snippets compared to PLLM's 54.8%, while reducing median resolution time from 151.5 s to 23.9 s (6.3x faster) and average LLM calls from ~24.9 to 2.26 per snippet (11x reduction).