Soneya Binta Hossain

SE
3papers
Novelty38%
AI Score46

3 Papers

39.4SEMay 21Code
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

Tasfia Tasnim, Soneya Binta Hossain

LLM-based software engineering increasingly depends on executable, context-rich bug artifacts: paired correct and buggy code, methods under test (MUTs), documentation, and metadata. These artifacts support the training and evaluation of automated bug localization and repair techniques, testing and test oracle generation methods, and documentation-driven automation. Although curated benchmarks (e.g., Defects4J) remain valuable, they are static and increasingly vulnerable to contamination as code models are trained on large public corpora. A complementary strategy is to generate fresh, cutoff-aware datasets by selecting real system versions and injecting controlled bugs at the source level. Mutation testing is a natural basis for this strategy: it applies predefined mutation operators to programs and records whether the existing test suite detects each injected change. PIT is a state-of-the-practice mutation testing tool for Java that performs mutation at the bytecode level. This design makes mutation testing fast and practical, but PITMuS reports mutants primarily through XML, making them difficult to inspect, replay, or reuse as structured source-level dataset records. To address this gap, we present PITMuS, which combines PITMuS XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although we evaluate PITMuS on eight open-source Java systems, it can be applied to any Java system where PITMuS can be integrated.

44.6SEApr 28Code
From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

Nazia Shehnaz Joynab, Soneya Binta Hossain

Resolution of complex post-production issues in large-scale open-source software (OSS) projects requires significant cognitive effort, as developers need to go through long, unstructured and fragmented issue discussion threads before that. In this paper, we present SWE-MIMIC-Bench, an issue trajectory dataset generated from raw GitHub discussions using an automated multi-LLM pipeline. Unlike simple summarization, this pipeline utilizes a group of closed-source LLMs to perform granular tasks: analyzing individual comments with awareness of externally-linked resources, classifying comment analyses into label-specific fields (e.g., root cause, solution plan, implementation progress), and synthesizing label-aware trajectories which capture a structured and coherent narrative of the entire discussion thread. Our pipeline uses five closed-source LLM configurations for distinct purposes: label classification, inline code block and external link summarization, comment analysis, label-specific field classification and trajectory synthesis. By generating concise and reliable trajectories from complex conversation threads, this system can assist developers and researchers of broader software engineering community to understand the experience-driven collaborative approach for issue diagnosis. Furthermore, the generated trajectories can be used to train modern LLM agents to think and act like an expert developer. We evaluated our system on 800 real-world GitHub issues drawn from the SWE-Bench-Pro, SWE-Bench-Multilingual and SWE-Bench-Verified dataset, achieving a 91.7% success rate in extracting 734 high-fidelity reasoning trajectories.

36.8SEApr 3
Measuring LLM Trust Allocation Across Conflicting Software Artifacts

Noshin Ulfat, Ahsanul Ameen Sabit, Soneya Binta Hossain

LLM-based software engineering assistants fail not only by producing incorrect outputs, but also by allocating trust to the wrong artifact when code, documentation, and tests disagree. Existing evaluations focus mainly on downstream outcomes and therefore cannot reveal whether a model recognized degraded evidence, identified the unreliable source, or calibrated its trust across artifacts. We present TRACE (Trust Reasoning over Artifacts for Calibrated Evaluation), a framework that elicits structured artifact-level trust traces over Javadoc, method signatures, implementations, and test prefixes under blind perturbations. Using 22,339 valid traces from seven models on 456 curated Java method bundles, we evaluate per-artifact quality assessment, inconsistency detection, affected artifact attribution, and source prioritization. Across all models, quality penalties are largely localized to the perturbed artifact and increase with severity, but sensitivity is asymmetric across artifact types: documentation bugs induce a substantially larger heavy-to-subtle gap than implementation faults (0.152-0.253 vs. 0.049-0.123). Models detect explicit documentation bugs well (67-94%) and Javadoc and implementation contradictions at 50-91%, yet show a systematic blind spot when only the implementation drifts while the documentation remains plausible, with detection dropping by 7-42 percentage points. Confidence is poorly calibrated for six of seven models. These findings suggest that current LLMs are better at auditing natural-language specifications than at detecting subtle code-level drift, motivating explicit artifact-level trust reasoning before correctness-critical downstream use.