Md Nakhla Rafi

3papers

3 Papers

24.0SEJul 8

Bug Report Specification Refinement with Trajectory Guidance for Automated Program Repair

S M Farah Al Fahim, Md Nakhla Rafi, Md Ahasanuzzaman et al.

Bug reports serve as task specifications for repository-level automated program repair (APR) agents, but they often describe only the observed failure and omit repair-relevant information such as the failure-inducing behavior, behavioral requirement, and implementation scope. As a result, a repair agent may inspect irrelevant code, infer an incorrect requirement, or generate a patch that addresses the reported symptom without restoring the intended repository behavior. We present TrajSpec, a trajectory-guided approach for repository-supported bug report specification refinement. Given an original report and a pre-fix repository, TrajSpec runs a trajectory-collection agent and uses the resulting unverified trajectory as a source of trajectory-derived specification evidence. It organizes this evidence into a three-level representation consisting of a high-level interpretation of the issue, diagnostic findings supporting that interpretation, and concrete repository observations. TrajSpec then generates a draft refined report and applies repository-based review to remove unsupported claims, revise uncertain claims, and add repository-supported details. We evaluate TrajSpec on all 300 SWE-Bench Lite instances using Mini-SWE-Agent V2. TrajSpec's refined reports improve Pass@1 from 41.00% to 59.67% with GPT-5-mini and from 54.67% to 64.33% with MiniMax M2.5. On a stratified sample of 100 instances, TrajSpec's refined reports also improve Pass@1 from 41.00% to 71.00% with Agentless and from 47.00% to 72.00% with AutoCodeRover. Ablation results show that removing repository-based review or the hierarchical evidence representation reduces Pass@1 from 59.67% to 48.00% and 47.67%, respectively. Overall, TrajSpec provides actionable repository-supported context that consistently improves repair performance.

9.0AIMay 30

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim et al.

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

7.8SEApr 29

CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows

Rabeya Khatun Muna, Md Nakhla Rafi, Tse-Hsun et al.

Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full CI re-execution under original workflows. Failures are categorized into 12 CI error types, enabling fine-grained, error-type-aware evaluation. To demonstrate benchmark usage, we include a reference CI repair workflow that analyzes CI logs to localize faults and generate candidate patches. Empirical results show that automated repair is most effective for localized, tool-enforced failures such as formatting and linting, while environment, dependency, and configuration-related failures remain challenging; the best-performing LLM achieves an 18.9% repair success rate. CI-Repair-Bench provides a realistic evaluation foundation for advancing research on CI-native automated program repair.