Xavier Devroey

10papers

414citations

Novelty37%

AI Score49

Ranked #46,362 of 201,326 authors (top 23%)#479 in SE (top 14%)

10 Papers

SEMar 6

Real-World Fault Detection for C-Extended Python Projects with Automated Unit Test Generation

Lucas Berg, Lukas Krodinger, Stephan Lukasczyk et al.

Many popular Python libraries use C-extensions for performance-critical operations allowing users to combine the best of the two worlds: The simplicity and versatility of Python and the performance of C. A drawback of this approach is that exceptions raised in C can bypass Python's exception handling and cause the entire interpreter to crash. These crashes are real faults if they occur when calling a public API. While automated test generation should, in principle, detect such faults, crashes in native code can halt the test process entirely, preventing detection or reproduction of the underlying errors and inhibiting coverage of non-crashing parts of the code. To overcome this problem, we propose separating the generation and execution stages of the test-generation process. We therefore adapt Pynguin, an automated test case generation tool for Python, to use subprocess-execution. Executing each generated test in an isolated subprocess prevents a crash from halting the test generation process itself. This allows us to (1) detect such faults, (2) generate reproducible crash-revealing test cases for them, (3) allow studying the underlying faults, and (4) enable test generation for non-crashing parts of the code. To evaluate our approach, we created a dataset consisting of 1648 modules from 21 popular Python libraries with C-extensions. Subprocess-execution allowed automated testing of up to 56.5% more modules and discovered 213 unique crash causes, revealing 32 previously unknown faults.

SEMay 8Code

System Test Generation for Virtual Reality Applications using Scenario Models

Gerry Longfils, Maxime Cauz, Arnaud Blouin et al.

Virtual Reality (VR) applications are increasingly being integrated across a wide range of domains, including surgical training and industrial marketing. However, the long-term adoption and maintenance of VR applications remain limited, particularly due to the lack of effective, systematic, and reproducible software testing approaches tailored to their unique characteristics. To address this issue, we introduce UltraInstinctVR, a novel testing approach for VR applications. Relying on predefined VR models (scenarios), it automates the generation and execution of concrete VR system tests. In our empirical evaluation, we compare UltraInstinctVR with state-of-the-art automated VR testing approaches in terms of coverage and failure detection on 10 open-source VR applications. The results show that UltraInstinctVR outperforms existing automated tools for detecting unique failures and provides valuable insights for identifying real-world bugs in VR applications.

SEJan 13, 2020Code

Generating Class-Level Integration Tests Using Call Site Information

Pouria Derakhshanfar, Xavier Devroey, Annibale Panichella et al.

Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated unit-tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose CLING, an integration-level test case generation approach that exploits how a pair of classes, the caller and the callee, interact with each other through method calls. In particular, CLING generates integration-level test cases that maximize the Coupled Branches Criterion (CBC). Coupled branches are pairs of branches containing a branch of the caller and a branch of the callee such that an integration test that exercises the former also exercises the latter. CBC is a novel integration-level coverage criterion, measuring the degree to which a test suite exercises the interactions between a caller and its callee classes. We implemented CLING and evaluated the approach on 140 pairs of classes from five different open-source Java projects. Our results show that (1) CLING generates test suites with high CBC coverage, thanks to the definition of the test suite generation as a many-objectives problem where each couple of branches is an independent objective; (2) such generated suites trigger different class interactions and can kill on average 7.7% (with a maximum of 50%) of mutants that are not detected by tests generated at the unit level; (3) CLING can detect integration faults coming from wrong assumptions about the usage of the callee class (32 for our subject systems) that remain undetected when using automatically generated unit-level test suites.

SEDec 10, 2019Code

Search-based Crash Reproduction using Behavioral Model Seeding

Pouria Derakhshanfar, Xavier Devroey, Gilles Perrouin et al.

Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of classes can be used in the search process to produce realistic sequences of method calls which create the required objects. In this study, we introduce behavioral model seeding: a new seeding method which learns class usages from both the system under test and existing test cases. Learned usages are then synthesized in a behavioral model (state machine). Then, this model serves to guide the evolutionary process. To assess behavioral model-seeding, we evaluate it against test-seeding (the state-of-the-art technique for seeding realistic objects) and no-seeding (without seeding any class usage). For this evaluation, we use a benchmark of 124 hard-to-reproduce crashes stemming from six open-source projects. Our results indicate that behavioral model-seeding outperforms both test seeding and no-seeding by a minimum of 6% without any notable negative impact on efficiency.

SEOct 22, 2017Code

Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack

Axel Halin, Alexandre Nuttinck, Mathieu Acher et al.

Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effort required to do so remains acceptable. Not only this: we believe there is lots to be learned by systematically and exhaustively testing a configurable system. In this case study, we report on the first ever endeavour to test all possible configurations of an industry-strength, open source configurable software system, JHipster, a popular code generator for web applications. We built a testing scaffold for the 26,000+ configurations of JHipster using a cluster of 80 machines during 4 nights for a total of 4,376 hours (182 days) CPU time. We find that 35.70% configurations fail and we identify the feature interactions that cause the errors. We show that sampling strategies (like dissimilarity and 2-wise): (1) are more effective to find faults than the 12 default configurations used in the JHipster continuous integration; (2) can be too costly and exceed the available testing budget. We cross this quantitative analysis with the qualitative assessment of JHipster's lead developers.

SEApr 21

Systematic Detection of Energy Regression and Corresponding Code Patterns in Java Projects

François Bechet, Jérôme Maquoi, Luís Cruz et al.

Green software engineering is emerging as a crucial response to information technology's rising energy impact, especially in continuous development. However, there remain challenges in devising automated methods for identifying energy regressions across commits and their associated code change patterns. In particular, little effort has been put into automatically detecting regressions at the commit level by identifying statistically significant changes in energy consumption. In this paper, we introduce EnergyTrackr, an approach designed to detect energy regressions across multiple commits that can then be used to identify code anti-patterns potentially contributing to the increase of software energy consumption over time. We describe our empirical evaluation, including repository mining and source code analysis, made on 3,232 commits from three Java projects, and show the approach's ability to identify significant energy changes. We also highlight recurring anti-patterns such as missing early exits or costly dependency upgrades. We expect EnergyTrackr to assist developers in accurately monitoring energy regressions and improvements within their projects, identifying code anti-patterns, and helping them optimize their source code to reduce software energy consumption.

SEJun 14, 2021

JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

Xavier Devroey, Alessio Gambi, Juan Pablo Galeotti et al.

Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.

SEMay 3, 2020

Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Paul Ralph, Sebastian Baltes, Gianisa Adisaputri et al.

Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support.

SEMar 21, 2014

State Machine Flattening: Mapping Study and Assessment

Xavier Devroey, Gilles Perrouin, Maxime Cordy et al.

State machine formalisms equipped with hierarchy and parallelism allow to compactly model complex system behaviours. Such models can then be transformed into executable code or inputs for model-based testing and verification techniques. Generated artifacts are mostly flat descriptions of system behaviour. \emph{Flattening} is thus an essential step of these transformations. To assess the importance of flattening, we have defined and applied a systematic mapping process and 30 publications were finally selected. However, it appeared that flattening is rarely the sole focus of the publications and that care devoted to the description and validation of flattening techniques varies greatly. Preliminary assessment of associated tool support indicated limited tool availability and scalability on challenging models. We see this initial investigation as a first step towards generic flattening techniques and scalable tool support, cornerstones of reliable model-based behavioural development.

SEOct 9, 2013

Towards Statistical Prioritization for Software Product Lines Testing

Xavier Devroey, Maxime Cordy, Gilles Perrouin et al.

Software Product Lines (SPL) are inherently difficult to test due to the combinatorial explosion of the number of products to consider. To reduce the number of products to test, sampling techniques such as combinatorial interaction testing have been proposed. They usually start from a feature model and apply a coverage criterion (e.g. pairwise feature interaction or dissimilarity) to generate tractable, fault-finding, lists of configurations to be tested. Prioritization can also be used to sort/generate such lists, optimizing coverage criteria or weights assigned to features. However, current sampling/prioritization techniques barely take product behavior into account. We explore how ideas of statistical testing, based on a usage model (a Markov chain), can be used to extract configurations of interest according to the likelihood of their executions. These executions are gathered in featured transition systems, compact representation of SPL behavior. We discuss possible scenarios and give a prioritization procedure illustrated on an example.