SEMay 5, 2019

Better Data Labelling with EMBLEM (and how that Impacts Defect Prediction)

arXiv:1905.01719v351 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient and accurate defect prediction in software engineering, offering a cost-effective solution for developers and maintainers, though it is incremental as it builds on existing AI methods with human input.

The paper tackles the problem of labeling problematic software development commits by introducing EMBLEM, a human-AI partnership that incrementally applies expertise to improve labeling accuracy, resulting in at least an 8x cost reduction and significant improvements in P_opt20 and G-scores across 9 open-source projects.

Standard automatic methods for recognizing problematic development commits can be greatly improved via the incremental application of human+artificial expertise. In this approach, called EMBLEM, an AI tool first explore the software development process to label commits that are most problematic. Humans then apply their expertise to check those labels (perhaps resulting in the AI updating the support vectors within their SVM learner). We recommend this human+AI partnership, for several reasons. When a new domain is encountered, EMBLEM can learn better ways to label which comments refer to real problems. Also, in studies with 9 open source software projects, labelling via EMBLEM's incremental application of human+AI is at least an order of magnitude cheaper than existing methods ($\approx$ eight times). Further, EMBLEM is very effective. For the data sets explored here, EMBLEM better labelling methods significantly improved $P_{opt}20$ and G-scores performance in nearly all the projects studied here.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes