9.1SEMay 20
A Dataset of Reproducible Flaky-Test FailuresSuzzana Rafi, Mahbub-Ul-Hoque Sumon, Md Erfan et al.
Flaky tests pass and fail non-deterministically when run on the same version of code. Although many techniques have been proposed to detect, debug, and repair flaky tests, reproducing their failures remains a major challenge due to their inherent nondeterminism. Many flaky test datasets exist to help researchers study them, but these datasets are often composed of disjoint sets of flaky tests, where each dataset provides unique information, such as flaky tests of different categories, failure logs of flaky tests, or flaky tests reported by developers vs. flaky tests found by automated tools. In this work, we aim to create a reproducible dataset of flaky tests, curated from both developer issue reports and a popular dataset of flaky tests. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures (e.g., challenges researchers may face when using prior flaky test datasets), the characteristics (e.g., location of the fix and its correlation with the flaky test category), and difficulties researchers may face in using our dataset to collect additional information (e.g., code coverage) about flaky tests. Our findings show that error information helps identify flaky test categories and guide repairs, that unresolved compilation failures highlight challenges in building legacy projects, and knowing typical fix locations can help prioritize repair efforts.
SEMar 6
Understanding and Finding JIT Compiler Performance BugsZijian Yi, Cheng Ding, August Shi et al.
Just-in-time (JIT) compilers are key components for many popular programming languages with managed runtimes (e.g., Java and JavaScript). JIT compilers perform optimizations and generate native code at runtime based on dynamic profiling data, to improve the execution performance of the running application. Like other software systems, JIT compilers might have software bugs, and prior work has developed a number of automated techniques for detecting functional bugs (i.e., generated native code does not semantically match that of the original code). However, no prior work has targeted JIT compiler performance bugs, which can cause significant performance degradation while an application is running. These performance bugs are challenging to detect due to the complexity and dynamic nature of JIT compilers. In this paper, we present the first work on demystifying JIT performance bugs. First, we perform an empirical study across four popular JIT compilers for Java and JavaScript. Our manual analysis of 191 bug reports uncovers common triggers of performance bugs, patterns in which these bugs manifest, and their root causes. Second, informed by these insights, we propose layered differential performance testing, a lightweight technique to automatically detect JIT compiler performance bugs, and implement it in a tool called Jittery. We incorporate practical optimizations into Jittery such as test prioritization, which reduces testing time by 92.40% without compromising bug-detection capability, and automatic filtering of false-positives and duplicates, which substantially reduces manual inspection effort. Using Jittery, we discovered 12 previously unknown performance bugs in the Oracle HotSpot and Graal JIT compilers, with 11 confirmed and 6 fixed by developers.
SENov 18, 2025
FlakyGuard: Automatically Fixing Flaky Tests at Industry ScaleChengpeng Li, Farnaz Behrang, August Shi et al.
Flaky tests that non-deterministically pass or fail waste developer time and slow release cycles. While large language models (LLMs) show promise for automatically repairing flaky tests, existing approaches like FlakyDoctor fail in industrial settings due to the context problem: providing either too little context (missing critical production code) or too much context (overwhelming the LLM with irrelevant information). We present FlakyGuard, which addresses this problem by treating code as a graph structure and using selective graph exploration to find only the most relevant context. Evaluation on real-world flaky tests from industrial repositories shows that FlakyGuard repairs 47.6 % of reproducible flaky tests with 51.8 % of the fixes accepted by developers. Besides it outperforms state-of-the-art approaches by at least 22 % in repair success rate. Developer surveys confirm that 100 % find FlakyGuard's root cause explanations useful.