LGHCSep 29, 2021

PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

arXiv:2109.14430v11 citations
Originality Synthesis-oriented
AI Analysis

This tool supports data-centric analysis for ML practitioners by providing visual insights into dataset hardness and model performance, though it is incremental as it applies an existing methodology to new contexts.

The authors tackled the problem of assessing dataset quality and model performance by developing PyHard, a tool that uses Instance Space Analysis to generate hardness embeddings, which helped identify challenging observations and model strengths/weaknesses in a COVID prognosis dataset.

For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes