75.0SEApr 28Code
LLM-Guided Issue Generation from Uncovered Code SegmentsDiany Pressato, Honghao Tan, Mariam Elmoazen et al.
Developers are increasingly overwhelmed by AI-generated issue reports that lack actionability and reproducibility, eroding trust in automated bug detection tools. In this paper, we present IssueSpecter, an automated tool that finds bugs in uncovered code segments and automatically generates prioritized, actionable issue reports. IssueSpecter combines coverage analysis with LLM-based defect identification, producing structured reports complete with severity ratings, reproduction steps, and suggested fixes. We evaluate IssueSpecter on 13 actively maintained Python projects, generating 10,467 issue reports. Manual annotation of the top-130 ranked issues by IssueSpecter confirms that 84.6% of the LLM-generated issues are valid or warrant further investigation, with only 15.4% false positives. LLM-based ranking outperforms rule-based ranking by 50% at P@3 and 41% in MRR. The identified bugs cover a wide variety of types, from logic and boundary errors to security vulnerabilities and state consistency bugs. By ranking issues by priority, IssueSpecter aims to help developers focus their attention on the most impactful bugs first. Finally, we validate IssueSpecter through case studies reproducing real bugs surfaced from its generated issue reports, demonstrating its practical value for automatic bug discovery in open-source Python projects. Compared against CoverUp, a state-of-the-art coverage-driven test generation tool, IssueSpecter achieves a higher bug validity rate (81.0% vs. 76.2%) under identical evaluation conditions, using the same model and the same number of evaluated artifacts per project, while additionally providing structured issue reports with reproduction steps and candidate fixes that are immediately actionable without requiring developers to interpret generated test intent.
60.4SEApr 13
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated CodeHonghao Tan, Haibo Wang, Shin Hwei Tan
Large language models (LLMs) for code generation are typically evaluated on functional correctness alone, overlooking whether generated code propagates harmful content embedded in the prompt. Prior work has shown that most Code LLMs reproduce offensive identifiers from injected renaming instructions without warning, yet existing approaches focus on detecting harmful content, neglecting functional correctness. Grounded in the Theory of Dual Channel Constraints (which states that code is a dual-channel medium combining an algorithmic (AL) channel for machine execution and a natural language (NL) channel for human communication, creating a unique safety-utility trade-off where a model must balance functional execution with responsible communication), we propose NLSafety-Utility Duality Score (SUDS), a metric that unifies code utility, safety adherence, and warning awareness into a single score across 12 ranked response scenarios, and Dual Reasoning (DR), a structured inference-time technique that requires an explicit safety audit and task-grounded code review before code generation. Evaluated on five LLMs across two benchmarks augmented with harmful keyword injections (820 and 2,135 samples), DR consistently achieves the highest SUDS across all models, improving mean SUDS by 1.32$\times$ to 3.42$\times$ over the baseline, while chain-of-thought prompting yields negligible safety gains and a safety-aware prompt provides only partial improvement. Further analysis reveals that DR's effectiveness scales with model capacity, that the one-shot exemplar primarily stabilizes output format for smaller models, and that structured reasoning cannot compensate for models with limited safety vocabularies.