52.9SEMar 30Code
How Low Can You Go? The Data-Light SE ChallengeKishan Kumar Ganguly, Tim Menzies
Much of Software Engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested? The counter-evidence presented in this paper suggests otherwise. For over 100 optimization tasks from recent SE papers (including software configuration, performance tuning, product line engineering, project health forecasting, defect prediction, software testing, software process and cost estimation, and cross-domain generalization datasets), even with just a few dozen labels, very simple methods (e.g., diversity sampling, a minimal Bayesian learner, its distance-based non-parametric variant, or random probes) achieve over 90% of the best reported results. Furthermore, these simple methods perform just as well as more complex state-of-the-the-art optimizers like SMAC, TPE, DEHB etc. While some tasks would require better outcomes and more sampling, these results seen after a few dozen samples would suffice for many engineering needs (particularly when the goal is rapid and cost-efficient guidance rather than slow and exhaustive optimization). To say that another ways, at least some SE tasks are better served by lightweight approaches that demand fewer labels and far less computation. We hence propose the data-light challenge: when will a handful of labels suffice for SE tasks? To enable a large-scale investigation of this issue, we contribute (1) a mathematical formalization of labeling, (2) lightweight baseline algorithms, and (3) results on public-domain data showing the conditions under which lightweight methods excel or fail. For the purposes of open science, our scripts and data are online at https://github.com/KKGanguly/NEO .
SEDec 19, 2025Code
From Coverage to Causes: Data-Centric Fuzzing for JavaScript EnginesKishan Kumar Ganguly, Tim Menzies
Context: Exhaustive fuzzing of modern JavaScript engines is infeasible due to the vast number of program states and execution paths. Coverage-guided fuzzers waste effort on low-risk inputs, often ignoring vulnerability-triggering ones that do not increase coverage. Existing heuristics proposed to mitigate this require expert effort, are brittle, and hard to adapt. Objective: We propose a data-centric, LLM-boosted alternative that learns from historical vulnerabilities to automatically identify minimal static (code) and dynamic (runtime) features for detecting high-risk inputs. Method: Guided by historical V8 bugs, iterative prompting generated 115 static and 49 dynamic features, with the latter requiring only five trace flags, minimizing instrumentation cost. After feature selection, 41 features remained to train an XGBoost model to predict high-risk inputs during fuzzing. Results: Combining static and dynamic features yields over 85% precision and under 1% false alarms. Only 25% of these features are needed for comparable performance, showing that most of the search space is irrelevant. Conclusion: This work introduces feature-guided fuzzing, an automated data-driven approach that replaces coverage with data-directed inference, guiding fuzzers toward high-risk states for faster, targeted, and reproducible vulnerability discovery. To support open science, all scripts and data are available at https://github.com/KKGanguly/DataCentricFuzzJS .
37.8SEMay 10
Zoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSEKishan Kumar Ganguly, Tim Menzies
Traditional Search-Based Software Engineering (SBSE) assumes global search and full Pareto exploration are essential. We offer the following negative result based on a study of over 100 Software Engineering (SE) optimization tasks: "zooming" into promising regions is far more effective than Pareto and global exploration under constrained evaluation budgets. Our minimal greedy zoom method, EZR, runs three orders of magnitude faster than Pareto and global Bayesian methods, achieving higher statistical ranks and winning or tying in 84-89\% of datasets on equal budget. Even at one-fifth the evaluation budget, EZR wins or ties in 79-81\% of datasets. Surprisingly, despite never explicitly seeking a frontier, EZR matches or outperforms Pareto methods on their own coverage metrics (IGD, HV) at equal budgets. The explanation for this widespread failure is structural: across the datasets studied, Pareto-optimal solutions form a tiny, tight island concentrated in a compact region of decision space. Methods that wander waste their budgets outside this island. Beyond efficiency, zooming yields small, interpretable models, thus addressing concerns about black-box AI. By replacing global wandering with greedy zooming, we make SBSE much faster, more explicable, and hence accessible to a wider audience. SBSE practitioners and researchers should zoom, not wander.
43.4SEMar 11
From Verification to Herding: Exploiting Software's Sparsity of InfluenceTim Menzies, Kishan Kumar Ganguly
Software verification is now costly, taking over half the project effort while failing on modern complex systems. We hence propose a shift from verification and modeling to herding: treating testing as a model-free search task that steers systems toward target goals. This exploits the "Sparsity of Influence" -the fact that, often, large software state spaces are ruled by just a few variables, We introduce EZR (Efficient Zero-knowledge Ranker), a stochastic learner that finds these controllers directly. Across dozens of tasks, EZR achieved 90% of peak results with only 32 samples, replacing heavy solvers with light sampling.
IRDec 1, 2020
A Statistical Real-Time Prediction Model for Recommender SystemMd Rifat Arefin, Minhas Kamal, Kishan Kumar Ganguly et al.
Recommender system has become an inseparable part of online shopping and its usability is increasing with the advancement of these e-commerce sites. An effective and efficient recommender system benefits both the seller and the buyer significantly. We considered user activities and product information for the filtering process in our proposed recommender system. Our model has achieved inspiring result (approximately 58% true-positive and 13% false-positive) for the data set provided by RecSys Challenge 2015. This paper aims to describe a statistical model that will help to predict the buying behavior of a user in real-time during a session.
CYAug 26, 2020
Impact on the Productivity of Remotely Working IT Professionals of Bangladesh during the Coronavirus Disease 2019Kishan Kumar Ganguly, Noshin Tahsin, Mridha Md. Nafis Fuad et al.
Similar to the rest of the world, the recent pandemic situation has forced the IT professionals of Bangladesh to adopt remote work. The aim of this study is to find out whether remote work can be continued even after the lockdown is lifted. As work from home may change various productivity related aspects of the employees, i.e., team dynamics and company dynamics, it is necessary to understand the nature of the change during WFH. Conducting a survey, we asked the IT professionals of Bangladesh how they perceive their level of productivity during WFH and how the factors related to productivity have changed. We analyzed the change and identified the areas affected by WFH. We discovered that resource and workspace related issues, emotional well-being of the employees have been hampered the most during WFH. We believe that the findings from this study will help to decide how to resolve those issues and will help to understand whether WFH can be continued even after the lockdown is lifted.