ML LG PRAug 28, 2024

Analysis of Diagnostics (Part II): Prevalence, Linear Independence, and Unsupervised Learning

Paul N. Patrone, Raquel A. Binder, Catherine S. Forconi, Ann M. Moormann, Anthony J. Kearsley

arXiv:2408.16035v13.1

Originality Incremental advance

AI Analysis

This work addresses the problem of extending uncertainty quantification and classification theory to unsupervised learning for researchers in machine learning and diagnostics, but it appears incremental as it builds directly on prior supervised learning results.

The paper extends a duality between prevalence and uncertainty quantification from supervised to unsupervised learning by introducing linearly independent populations and an isomorphism between classifiers, enabling unsupervised learning as a generalization of supervised learning with applications to synthetic data and a SARS-CoV-2 ELISA assay.

This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key idea of that analysis was to train a family of discriminative classifiers by minimizing a sum of prevalence-weighted empirical risk functions. The resulting outputs can be interpreted as relative probability level-sets, which thereby yield uncertainty estimates in the class labels. This procedure also demonstrated that certain discriminative and generative ML models are equivalent. Part II considers the extent to which these results can be extended to tasks in unsupervised learning through recourse to ideas in linear algebra. We first observe that the distribution of an impure population, for which the class of a corresponding sample is unknown, can be parameterized in terms of a prevalence. This motivates us to introduce the concept of linearly independent populations, which have different but unknown prevalence values. Using this, we identify an isomorphism between classifiers defined in terms of impure and pure populations. In certain cases, this also leads to a nonlinear system of equations whose solution yields the prevalence values of the linearly independent populations, fully realizing unsupervised learning as a generalization of supervised learning. We illustrate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent assay (ELISA).

View on arXiv PDF

Similar