Quantifying the Uncertainty of Precision Estimates for Rule based Text Classifiers
This provides a method for assessing uncertainty in text classification, which is incremental as it builds on existing rule-based and statistical frameworks.
The paper tackles the problem of quantifying uncertainty in precision estimates for rule-based text classifiers by treating partitions of sub-string sets as Bernoulli random variables, and demonstrates its utility on a benchmark problem.
Rule based classifiers that use the presence and absence of key sub-strings to make classification decisions have a natural mechanism for quantifying the uncertainty of their precision. For a binary classifier, the key insight is to treat partitions of the sub-string set induced by the documents as Bernoulli random variables. The mean value of each random variable is an estimate of the classifier's precision when presented with a document inducing that partition. These means can be compared, using standard statistical tests, to a desired or expected classifier precision. A set of binary classifiers can be combined into a single, multi-label classifier by an application of the Dempster-Shafer theory of evidence. The utility of this approach is demonstrated with a benchmark problem.