Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework
This work addresses the challenge of handling ambiguous human annotations in machine learning, providing tools for better uncertainty quantification, though it is incremental by building on existing entropy measures.
The authors tackled the problem of quantifying ambiguity in categorical annotations by introducing a new measure that distinguishes between class-level indistinguishability and explicit unresolvability, and they developed statistical inference tools including frequentist estimators and Bayesian posteriors for practical applications like dataset-quality assessment.
Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit "can't solve" category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure's formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.