Barbara Kitchenham

h-index73

3papers

67citations

Novelty18%

AI Score32

Ranked #126,112 of 194,257 authors (top 65%)#1,391 in SE (top 46%)

3 Papers

3.4SENov 16, 2025

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.

5.9SEFeb 7, 2022

A longitudinal case study on the effects of an evidence-based software engineering training

Sebastián Pizard, Diego Vallespir, Barbara Kitchenham

Context: Evidence-based software engineering (EBSE) can be an effective resource to bridge the gap between academia and industry by balancing research of practical relevance and academic rigor. To achieve this, it seems necessary to investigate EBSE training and its benefits for the practice. Objective: We sought both to develop an EBSE training course for university students and to investigate what effects it has on the attitudes and behaviors of the trainees. Method: We conducted a longitudinal case study to study our EBSE course and its effects. For this, we collect data at the end of each EBSE course (2017, 2018, and 2019), and in two follow-up surveys (one after 7 months of finishing the last course, and a second after 21 months). Results: Our EBSE courses seem to have taught students adequately and consistently. Half of the respondents to the surveys report making use of the new skills from the course. The most-reported effects in both surveys indicated that EBSE concepts increase awareness of the value of research and evidence and EBSE methods improve information gathering skills. Conclusions: As suggested by research in other areas, training appears to play a key role in the adoption of evidence-based practice. Our results indicate that our training method provides an introduction to EBSE suitable for undergraduates. However, we believe it is necessary to continue investigating EBSE training and its impact on software engineering practice.

30.5SEOct 7, 2020Code

Empirical Standards for Software Engineering Research

Paul Ralph, Nauman bin Ali, Sebastian Baltes et al.

Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.