Linhao Wu

SE
4papers
1citation
Novelty53%
AI Score46

4 Papers

71.3SEMar 25
Fixturize: Bridging the Fixture Gap in Test Generation

Chengyi Wang, Pengyu Xue, Zhen Yang et al.

Current Large Language Models (LLMs) have advanced automated unit test generation but face a critical limitation: they often neglect to construct the necessary test fixtures, which are the environmental setups required for a test to run. To bridge this gap, this paper proposes Fixturize, a diagnostic framework that proactively identifies fixture-dependent functions and synthesizes test fixtures accordingly through an iterative, feedback-driven process, thereby improving the quality of auto-generated test suites of existing approaches. For rigorous evaluation, the authors introduce FixtureEval, a dedicated benchmark comprising 600 curated functions across two Programming Languages (PLs), i.e., Python and Java, with explicit fixture dependency labels, enabling both the corresponding classification and generation tasks. Empirical results demonstrate that Fixturize is highly effective, achieving 88.38%-97.00% accuracy across benchmarks in identifying the dependence of test fixtures and significantly enhancing the Suite Pass rate (SuitePS) by 18.03%-42.86% on average across both PLs with the auto-generated fixtures. Owing to the maintenance of test fixtures, Fixturize further improves line/branch coverage when integrated with existing testing tools of both LLM-based and Search-based by 16.85%/24.08% and 31.54%/119.66% on average, respectively. The findings establish fixture awareness as an essential, missing component in modern auto-testing pipelines.

67.4SEApr 4
Toward Executable Repository-Level Code Generation via Environment Alignment

Ruwei Pan, Junlei Shen, Linhao Wu et al.

Large language models (LLMs) have achieved strong performance on code generation, but existing methods still struggle with repository-level code generation under executable validation. Under this evaluation setting, success is determined not by the plausibility of isolated code fragments, but by whether a generated multi-file repository can be successfully installed, have its dependencies and internal references resolved, be launched, and be validated in a real execution environment. To address this challenge, we propose EnvGraph, a framework for repository-level code generation that formulates repository executability as an environment alignment problem. EnvGraph jointly models two coupled conditions for successful repository execution, namely external dependency satisfaction and repository-internal reference resolution. It maintains a dual-layer environment representation, uses execution evidence to perform execution-evidence-based attribution, and guides repository generation through a unified targeted revision mechanism within an iterative alignment loop. We evaluate EnvGraph on repository-level code generation with three representative backbone LLMs and compare it against representative environment-aware and repository-level baselines. Experimental results show that EnvGraph consistently achieves the best performance on these repository-level benchmarks. In particular, it outperforms the strongest non-EnvGraph baseline by an absolute margin of 5.72--5.87 percentage points in Functional Correctness and 4.58--8.66 percentage points in Non-Functional Quality.

76.1SEApr 4
Persistent Cross-Attempt State Optimization for Repository-Level Code Generation

Ruwei Pan, Jiangshuai Wang, Qisheng Zhang et al.

Large language models (LLMs) have achieved substantial progress in repository-level code generation. However, solving the same repository-level task often requires multiple attempts, while existing methods still optimize each attempt in isolation and do not preserve or reuse task-specific state across attempts. In this paper, we propose LiveCoder, a novel framework for repository-level code generation based on cross-attempt knowledge optimization. LiveCoder maintains persistent task-specific state from prior attempts to guide subsequent generation. This state includes success knowledge, which captures reusable signals from previously strong repositories, failure knowledge, which records unsuccessful outcomes and their diagnostic signals, and a historical-best repository, which preserves the strongest result found so far and prevents regression. These components collectively transform repeated repository generation into a persistent, knowledge-driven optimization process. We evaluate LiveCoder using four frontier LLMs on two representative repository-level code generation benchmarks. Extensive experimental results demonstrate the effectiveness and efficiency of LiveCoder, improving the functional score by up to 22.94 percentage points, increasing repository reuse to 81.58%, and reducing cost by up to 53.63% on RAL-Bench while maintaining broadly stable non-functional quality.

64.4SEApr 21
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

Linhao Wu, Yifei Pei, Zhen Yang et al.

Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.