SEFeb 5, 2025Code
COSMosFL: Ensemble of Small Language Models for Fault LocalisationHyunjoon Cho, Sungmin Kang, Gabin An et al.
LLMs are rapidly being adopted to build powerful tools and agents for software engineering, but most of them rely heavily on extremely large closed-source models. This, in turn, can hinder wider adoption due to security issues as well as financial cost and environmental impact. Recently, a number of open source Small Language Models (SLMs) are being released and gaining traction. While SLMs are smaller, more energy-efficient, and therefore easier to locally deploy, they tend to show worse performance when compared to larger closed LLMs. We present COSMos, a task-level LLM ensemble technique that uses voting mechanism, to provide a broader range of choice between SLMs and LLMs. We instantiate COSMos with an LLM-based Fault Localisation technique, AutoFL, and report the cost-benefit trade-off between LLM accuracy and various costs such as energy consumption, inference time, and the number of tokens used. An empirical evaluation using Defects4J shows that COSMos can build effective ensembles that can achieve Pareto-optimality in terms of FL accuracy and inference cost, when compared to individual models.
SEAug 10, 2021
Searching for Multi-Fault Programs in Defects4JGabin An, Juyeon Yoon, Shin Yoo
Defects4J has enabled numerous software testing and debugging research work since its introduction. A large part of its contribution, and the resulting popularity, lies in the clear separation and distillation of the root cause of each individual test failure based on careful manual analysis, which in turn allowed researchers to easily study individual faults in isolation. However, in a realistic debugging scenario, multiple faults can coexist and affect test results collectively. Study of automated debugging techniques for these situations, such as failure clustering or fault localisation for multiple faults, would significantly benefit from a reliable benchmark of multiple, coexisting faults. We search for versions of Defects4J subjects that contain multiple faults, by iteratively transplanting fault-revealing test cases across Defects4J versions. Out of 326 studied versions of Defects4J subjects, we report that over 95% (311 versions) actually contain from two to 24 faults. We hope that the extended, multi-fault Defects4J can provide a platform for future research of testing and debugging techniques for multi-fault programs.
SEApr 21, 2021
Improving Test Distance for Failure Clustering with Hypergraph ModellingGabin An, Juyeon Yoon, Joyce Jiyoung Whang et al.
Automated debugging techniques, such as Fault Localisation (FL) or Automated Program Repair (APR), are typically designed under the Single Fault Assumption (SFA). However, in practice, an unknown number of faults can independently cause multiple test case failures, making it difficult to allocate resources for debugging and to use automated debugging techniques. Clustering algorithms have been applied to group the test failures according to their root causes, but their accuracy can often be lacking due to the inherent limits in the distance metrics for test cases. We introduce a new test distance metric based on hypergraphs and evaluate their accuracy using multi-fault benchmarks that we have built on top of Defects4J and SIR. Results show that our technique, Hybiscus, can automatically achieve perfect clustering (i.e., the same number of clusters as the ground truth number of root causes, with all failing tests with the same root cause grouped together) for 418 out of 605 test runs with multiple test failures. Better failure clustering also allows us to separate different root causes and apply FL techniques under SFA, resulting in saving up to 82% of the total wasted effort when compared to the state-of-the-art technique for multiple fault localisation.
SEApr 14, 2021
FDG: A Precise Measurement of Fault Diagnosability Gain of Test CasesGabin An, Shin Yoo
The performance of many Fault Localisation (FL) techniques directly depends on the quality of the used test suites. Consequently, it is extremely useful to be able to precisely measure how much diagnostic power each test case can introduce when added to a test suite used for FL. Such a measure can help us not only to prioritise and select test cases to be used for FL, but also to effectively augment test suites that are too weak to be used with FL techniques. We propose FDG, a new measure of Fault Diagnosability Gain for individual test cases. The design of FDG is based on our analysis of existing metrics that are designed to prioritise test cases for better FL. Unlike other metrics, FDG exploits the ongoing FL results to emphasise the parts of the program for which more information is needed. Our evaluation of FDG with Defects4J shows that it can successfully help the augmentation of test suites for better FL. When given only a few failing test cases (2.3 test cases on average), FDG can effectively augment the given test suite by prioritising the test cases generated automatically by EvoSuite: the augmentation can improve the acc@1 and acc@10 of the FL results by 11.6x and 2.2x on average, after requiring only ten human judgements on the correctness of the assertions EvoSuite generates.
NEJun 27, 2019
The State and Future of Genetic ImprovementWilliam B. Langdon, Westley Weimer, Christopher Timperley et al.
We report the discussion session at the sixth international Genetic Improvement workshop, GI-2019 @ ICSE, which was held as part of the 41st ACM/IEEE International Conference on Software Engineering on Tuesday 28th May 2019. Topics included GI representations, the maintainability of evolved code, automated software testing, future areas of GI research, such as co-evolution, and existing GI tools and benchmarks.
SEFeb 26, 2019
Ahead of Time Mutation Based Fault Localisation using Statistical InferenceJinhan Kim, Gabin An, Robert Feldt et al.
Mutation analysis can effectively capture the dependency between source code and test results. This has been exploited by Mutation Based Fault Localisation (MBFL) techniques. However, MBFL techniques suffer from the need to expend the high cost of mutation analysis after the observation of failures, which may present a challenge for its practical adoption. We introduce SIMFL (Statistical Inference for Mutation-based Fault Localisation), an MBFL technique that allows users to perform the mutation analysis in advance before a failure is observed, allowing the amortisation of the analysis cost. SIMFL uses mutants as artificial faults and aims to learn the failure patterns among test cases against different locations of mutations. Once a failure is observed, SIMFL requires either almost no or very small additional cost for analysis, depending on the used inference model. An empirical evaluation using Defects4J shows that SIMFL can successfully localise up to 113 out of 203 studied faults (55%) at the top, and 159 (78%) faults within the top five, significantly outperforming existing MBFL techniques while using the results of mutation analysis that has been undertaken before the test failure. The amortised cost of mutation analysis can be further reduced by mutation sampling: SIMFL retains 80% of its localisation accuracy at the top rank when using only 10% of generated mutants, compared to results obtained without sampling.