LGDec 4, 2022

Characterizing instance hardness in classification and regression problems

arXiv:2212.01897v12 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of improving data quality and learning strategies for machine learning practitioners by identifying hard-to-predict instances, though it appears incremental as it builds on existing concepts of instance hardness.

The paper tackles the problem of identifying and characterizing which instances in a dataset are hardest to predict accurately, proposing a set of meta-features for instance hardness measures in both classification and regression. It analyzes synthetic datasets with varying complexity levels and provides a Python package for implementation.

Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes