LG HCSep 29, 2021

PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

Pedro Yuri Arbs Paiva, Kate Smith-Miles, Maria Gabriela Valeriano, Ana Carolina Lorena

arXiv:2109.14430v13.11 citations

Originality Synthesis-oriented

AI Analysis

This tool supports data-centric analysis for ML practitioners by providing visual insights into dataset hardness and model performance, though it is incremental as it applies an existing methodology to new contexts.

The authors tackled the problem of assessing dataset quality and model performance by developing PyHard, a tool that uses Instance Space Analysis to generate hardness embeddings, which helped identify challenging observations and model strengths/weaknesses in a COVID prognosis dataset.

For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.

View on arXiv PDF

Similar