SEJan 16, 2023Code
Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project HealthAndre Lustosa, Tim Menzies
When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many prediction errors. Those errors can be tamed by a {\em landscape analysis} that selects better learner control parameters. Our niSNEAK tool (a)~clusters the data to find the general landscape of the hyperparameters; then (b)~explores a few representatives from each part of that landscape. niSNEAK is both faster and more effective than prior state-of-the-art hyperparameter optimization algorithms (e.g. FLASH, HYPEROPT, OPTUNA). The configurations found by niSNEAK have far less error than other methods. For example, for project health indicators such as $C$= number of commits; $I$=number of closed issues, and $R$=number of closed pull requests, niSNEAK's 12 month prediction errors are \{I=0\%, R=33\%\,C=47\%\} Based on the above, we recommend landscape analytics (e.g. niSNEAK) especially when learning from very small data sets. This paper only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak
SEOct 6, 2021Code
SNEAK: Faster Interactive Search-based SEAndre Lustosa, Jaydeep Patel, Venkata Sai Teja Malapati et al.
When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (that recurses into divisions of the data) perform better than standard iSBSE methods (that mutates multiple candidate solutions over many generations). For our case studies, SNEAK runs faster, asks fewer questions, achieves better solutions (that are within 3% of the best solutions seen in our sample space), and scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend SNEAK as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/sneak.
SEJun 7, 2021Code
Preference Discovery in Large Product LinesAndre Lustosa, Tim Menzies
When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questions). WHUN is an iSBSE algorithm that avoids that problem. Due to its recursive clustering procedure, WHUN only pesters humans for $O(log_2{N})$ interactions. Further, each interaction is mediated via a feature selection procedure that reduces the number of asked questions. When compared to prior state-of-the-art iSBSE systems, WHUN runs faster, asks fewer questions, and achieves better solutions that are within $0.1\%$ of the best solutions seen in our sample space. More importantly, WHUN scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend WHUN as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/whun.
SEJul 11, 2021
Fairer Software Made Easier (using "Keys")Tim Menzies, Kewen Peng, Andre Lustosa
Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically simplify even complex explanations, such as those required for ethical AI systems.