Alan F. Karr

5papers

10citations

Novelty51%

AI Score37

Ranked #115,453 of 205,806 authors (top 56%)#1,520 in ML (top 43%)

5 Papers

LGFeb 25

Effects of Training Data Quality on Classifier Performance

Alan F. Karr, Regina Ruane

We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

MLDec 8, 2022

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

Alan F. Karr, Zac Bowen, Adam A. Porter et al.

Classifiers assign complex input data points to one of a small number of output categories. For a Bayes classifier whose input space is a graph, we study the structure of the \emph{boundary}, which comprises those points for which at least one neighbor is classified differently. The scientific setting is assignment of DNA reads produced by \NGSs\ to candidate source genomes. The boundary is both large and complicated in structure. We introduce a new measure of uncertainty, Neighbor Similarity, that compares the result for an input point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented for classifiers without inherent measures of uncertainty.

GNDec 24, 2021

Application of Markov Structure of Genomes to Outlier Identification and Read Classification

Alan F. Karr, Jason Hauzel, Adam A. Porter et al.

In this paper we apply the structure of genomes as second-order Markov processes specified by the distributions of successive triplets of bases to two bioinformatics problems: identification of outliers in genome databases and read classification in metagenomics, using real coronavirus and adenovirus data.

MLDec 24, 2021

Measuring Quality of DNA Sequence Data via Degradation

Alan F. Karr, Jason Hauzel, Adam A. Porter et al.

We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.

QMSep 13, 2021

Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly

Alan F. Karr, Jason Hauzel, Prahlad Menon et al.

Specified Certainty Classification (SCC) is a new paradigm for employing classifiers whose outputs carry uncertainties, typically in the form of Bayesian posterior probabilities. By allowing the classifier output to be less precise than one of a set of atomic decisions, SCC allows all decisions to achieve a specified level of certainty, as well as provides insights into classifier behavior by examining all decisions that are possible. Our primary illustration is read classification for reference-guided genome assembly, but we demonstrate the breadth of SCC by also analyzing COVID-19 vaccination data.