SEOct 26, 2022
Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher et al.
Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.
18.0SEMar 19
SpaceTime Programming: Live and Omniscient Exploration of Code and ExecutionJean-Baptiste Döderlein, Djamel Eddine Khelladi, Mathieu Acher et al.
Programming environments typically separate the world of static code from the dynamic execution of programs. Developers must switch between writing code and observing its execution, often with limited tools to understand the relationship between code changes and runtime behavior. Several paradigms and approaches exist to bridge this gap, including exploratory programming for comparing code variants, live programming for immediate feedback, and omniscient debugging for exploring execution history. However, existing solutions tend to focus on specific aspects and one specific paradigm rather than providing a fully integrated environment with multiple capabilities. This paper introduces \spacetime Programming, a novel approach that unifies these paradigms to create a programming model for exploring both code modifications and execution flow. At the core of our approach is a trace mechanism that captures not only execution state but also the corresponding code changes, enabling developers to explore programs in both space (code variants) and time (execution flow). As a proof of concept, we implemented a Python library supporting SpaceTime Programming and applied it in two contexts: a live omniscient debugger and a Pygame game development tool, showcased through a Flappy Bird-like game. We further evaluated SpaceTimePy on five real-world Python projects, finding performance overhead ranging from 35% to 150% on test suites.
4.1SEApr 17
Small Yet Configurable: Unveiling Null Variability in SoftwareXhevahire Tërnava, Georges Aaron Randrianaina, Luc Lesoil et al.
Many small-scale software systems, that is, with limited codebase or binary size, are widely used in everyday tasks, yet their configurability remains largely unexplored. At the same time, studies on modern software systems show a trend toward increasing configurability, alongside growing interest in building immutable, specialized, and reproducible software. In this paper, we present the first empirical study on the extent of configurability in small-scale software systems. By analyzing 108 programs from GNU coreutils, we show that even small programs can exhibit significant compile-time and run-time variability, with up to 76 options per program. Then, there is a high correlation (0.78) between run-time variability and codebase size. Furthermore, an analysis of the 20 smallest programs across 85 releases reveals that variability tends to increase over time, primarily due to the added compile-time variability. This suggests that shifting options between run-time and compile-time, removing unnecessary run-time variability, or resolving compile-time variability early, can help reduce codebase complexity and size. We also introduce, for the first time, the concept of null-variable software system, one with no configurability beyond mandatory features. Our findings show that high configurability is not exclusive to large-scale systems and that reducing unnecessary variability can lead to lightweight, smaller, and more maintainable software. We hope this effort contributes to designing new software by understanding how to balance its configurability with codebase size.
SEOct 22, 2017Code
Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stackAxel Halin, Alexandre Nuttinck, Mathieu Acher et al.
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effort required to do so remains acceptable. Not only this: we believe there is lots to be learned by systematically and exhaustively testing a configurable system. In this case study, we report on the first ever endeavour to test all possible configurations of an industry-strength, open source configurable software system, JHipster, a popular code generator for web applications. We built a testing scaffold for the 26,000+ configurations of JHipster using a cluster of 80 machines during 4 nights for a total of 4,376 hours (182 days) CPU time. We find that 35.70% configurations fail and we identify the feature interactions that cause the errors. We show that sampling strategies (like dissimilarity and 2-wise): (1) are more effective to find faults than the 12 default configurations used in the JHipster continuous integration; (2) can be too costly and exceed the available testing budget. We cross this quantitative analysis with the qualitative assessment of JHipster's lead developers.
SEJul 13, 2025
Prompting for Performance: Exploring LLMs for Configuring SoftwareHelge Spieker, Théo Matricon, Nassim Belmecheri et al.
Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.
SEMay 12, 2025
Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution AnalysisHeraldo Borges, Juliana Alves Pereira, Djamel Eddine Khelladi et al.
Configuring the Linux kernel to meet specific requirements, such as binary size, is highly challenging due to its immense complexity-with over 15,000 interdependent options evolving rapidly across different versions. Although several studies have explored sampling strategies and machine learning methods to understand and predict the impact of configuration options, the literature still lacks a comprehensive and large-scale dataset encompassing multiple kernel versions along with detailed quantitative measurements. To bridge this gap, we introduce LinuxData, an accessible collection of kernel configurations spanning several kernel releases, specifically from versions 4.13 to 5.8. This dataset, gathered through automated tools and build processes, comprises over 240,000 kernel configurations systematically labeled with compilation outcomes and binary sizes. By providing detailed records of configuration evolution and capturing the intricate interplay among kernel options, our dataset enables innovative research in feature subset selection, prediction models based on machine learning, and transfer learning across kernel versions. Throughout this paper, we describe how the dataset has been made easily accessible via OpenML and illustrate how it can be leveraged using only a few lines of Python code to evaluate AI-based techniques, such as supervised machine learning. We anticipate that this dataset will significantly enhance reproducibility and foster new insights into configuration-space analysis at a scale that presents unique opportunities and inherent challenges, thereby advancing our understanding of the Linux kernel's configurability and evolution.
SEMay 7, 2025
LLM Code Customization with Visual Results: A Benchmark on TikZCharly Reux, Mathieu Acher, Djamel Eddine Khelladi et al.
With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise. However, even experienced developers can struggle with this task, as it requires identifying relevant code regions (feature location), generating valid code variants, and ensuring the modifications reliably align with user intent. In this paper, we introduce vTikZ, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to customize code while preserving coherent visual outcomes. Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness. Empirical evaluation with stateof-the-art LLMs shows that existing solutions struggle to reliably modify code in alignment with visual intent, highlighting a gap in current AI-assisted code editing approaches. We argue that vTikZ opens new research directions for integrating LLMs with visual feedback mechanisms to improve code customization tasks in various domains beyond TikZ, including image processing, art creation, Web design, and 3D modeling.
SEMar 20, 2025
Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMsDjamel Eddine Khelladi, Charly Reux, Mathieu Acher
Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs' ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests -- same input, different expected outputs -- across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics -- number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.
SEDec 14, 2021
The Interaction between Inputs and Configurations fed to Software Systems: an Empirical StudyLuc Lesoil, Mathieu Acher, Arnaud Blouin et al.
Widely used software systems such as video encoders are by necessity highly configurable, with hundreds or even thousands of options to choose from. Their users often have a hard time finding suitable values for these options (i.e. finding a proper configuration of the software system) to meet their goals for the tasks at hand, e.g. compress a video down to a certain size. One dimension of the problem is of course that performance depends on the input data: a video as input to an encoder like x264 or a file system fed to a tool like xz. To achieve good performance, users should therefore take into account both dimensions of (1) software variability and (2) input data. In this problem-statement paper, we conduct a large study over 8 configurable systems that quantifies the existing interactions between input data and configurations of software systems. The results exhibit that (1) inputs fed to software systems interact with their configuration options in non monotonous ways, significantly impacting their performance properties (2) tuning a software system for its input data makes it possible to multiply its performance by up to ten (3) input variability can jeopardize the relevance of performance predictive models for a field deployment.
SESep 16, 2019
Towards Quality Assurance of Software Product Lines with Adversarial ConfigurationsPaul Temple, Mathieu Acher, Gilles Perrouin et al.
Software product line (SPL) engineers put a lot of effort to ensure that, through the setting of a large number of possible configuration options, products are acceptable and well-tailored to customers' needs. Unfortunately, options and their mutual interactions create a huge configuration space which is intractable to exhaustively explore. Instead of testing all products, machine learning techniques are increasingly employed to approximate the set of acceptable products out of a small training sample of configurations. Machine learning (ML) techniques can refine a software product line through learned constraints and a priori prevent non-acceptable products to be derived. In this paper, we use adversarial ML techniques to generate adversarial configurations fooling ML classifiers and pinpoint incorrect classifications of products (videos) derived from an industrial video generator. Our attacks yield (up to) a 100% misclassification rate and a drop in accuracy of 5%. We discuss the implications these results have on SPL quality assurance.
SEJun 7, 2019
Learning Software Configuration Spaces: A Systematic Literature ReviewJuliana Alves Pereira, Hugo Martin, Mathieu Acher et al.
Most modern software systems (operating systems like Linux or Android, Web browsers like Firefox or Chrome, video encoders like ffmpeg, x264 or VLC, mobile and cloud applications, etc.) are highly-configurable. Hundreds of configuration options, features, or plugins can be combined, each potentially with distinct functionality and effects on execution time, security, energy consumption, etc. Due to the combinatorial explosion and the cost of executing software, it is quickly impossible to exhaustively explore the whole configuration space. Hence, numerous works have investigated the idea of learning it from a small sample of configurations' measurements. The pattern "sampling, measuring, learning" has emerged in the literature, with several practical interests for both software developers and end-users of configurable systems. In this survey, we report on the different application objectives (e.g., performance prediction, configuration optimization, constraint mining), use-cases, targeted software systems and application domains. We review the various strategies employed to gather a representative and cost-effective sample. We describe automated software techniques used to measure functional and non-functional properties of configurations. We classify machine learning algorithms and how they relate to the pursued application. Finally, we also describe how researchers evaluate the quality of the learning process. The findings from this systematic review show that the potential application objective is important; there are a vast number of case studies reported in the literature from the basis of several domains and software systems. Yet, the huge variant space of configurable systems is still challenging and calls to further investigate the synergies between artificial intelligence and software engineering.
LGMay 30, 2018
Towards Adversarial Configurations for Software Product LinesPaul Temple, Mathieu Acher, Battista Biggio et al.
Ensuring that all supposedly valid configurations of a software product line (SPL) lead to well-formed and acceptable products is challenging since it is most of the time impractical to enumerate and test all individual products of an SPL. Machine learning classifiers have been recently used to predict the acceptability of products associated with unseen configurations. For some configurations, a tiny change in their feature values can make them pass from acceptable to non-acceptable regarding users' requirements and vice-versa. In this paper, we introduce the idea of leveraging these specific configurations and their positions in the feature space to improve the classifier and therefore the engineering of an SPL. Starting from a variability model, we propose to use Adversarial Machine Learning techniques to create new, adversarial configurations out of already known configurations by modifying their feature values. Using an industrial video generator we show how adversarial configurations can improve not only the classifier, but also the variability model, the variability implementation, and the testing oracle.
AIApr 28, 2016
Large-scale Analysis of Chess Games with Chess Engines: A Preliminary ReportMathieu Acher, François Esnault
The strength of chess engines together with the availability of numerous chess games have attracted the attention of chess players, data scientists, and researchers during the last decades. State-of-the-art engines now provide an authoritative judgement that can be used in many applications like cheating detection, intrinsic ratings computation, skill assessment, or the study of human decision-making. A key issue for the research community is to gather a large dataset of chess games together with the judgement of chess engines. Unfortunately the analysis of each move takes lots of times. In this paper, we report our effort to analyse almost 5 millions chess games with a computing grid. During summer 2015, we processed 270 millions unique played positions using the Stockfish engine with a quite high depth (20). We populated a database of 1+ tera-octets of chess evaluations, representing an estimated time of 50 years of computation on a single machine. Our effort is a first step towards the replication of research results, the supply of open data and procedures for exploring new directions, and the investigation of software engineering/scalability issues when computing billions of moves.
SEFeb 16, 2015
Synthesis of Attributed Feature Models From Product Descriptions: FoundationsGuillaume Bécan, Razieh Behjati, Arnaud Gotlieb et al.
Feature modeling is a widely used formalism to characterize a set of products (also called configurations). As a manual elaboration is a long and arduous task, numerous techniques have been proposed to reverse engineer feature models from various kinds of artefacts. But none of them synthesize feature attributes (or constraints over attributes) despite the practical relevance of attributes for documenting the different values across a range of products. In this report, we develop an algorithm for synthesizing attributed feature models given a set of product descriptions. We present sound, complete, and parametrizable techniques for computing all possible hierarchies, feature groups, placements of feature attributes, domain values, and constraints. We perform a complexity analysis w.r.t. number of features, attributes, configurations, and domain size. We also evaluate the scalability of our synthesis procedure using randomized configuration matrices. This report is a first step that aims to describe the foundations for synthesizing attributed feature models.