96.0AIMar 31
Computational Hermeneutics: Evaluating generative AI as a cultural technologyCody Kommers, Ruth Ahnert, Maria Antoniak et al.
Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system's operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as "context machines" that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation -- that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.
LGJul 4, 2024
Zero-failure testing of binary classifiersIoannis Ivrissimtzis, Matthew Houliston, Shauna Concannon et al.
We propose using performance metrics derived from zero-failure testing to assess binary classifiers. The principal characteristic of the proposed approach is the asymmetric treatment of the two types of error. In particular, we construct a test set consisting of positive and negative samples, set the operating point of the binary classifier at the lowest value that will result to correct classifications of all positive samples, and use the algorithm's success rate on the negative samples as a performance measure. A property of the proposed approach, setting it apart from other commonly used testing methods, is that it allows the construction of a series of tests of increasing difficulty, corresponding to a nested sequence of positive sample test sets. We illustrate the proposed method on the problem of age estimation for determining whether a subject is above a legal age threshold, a problem that exemplifies the asymmetry of the two types of error. Indeed, misclassifying an under-aged subject is a legal and regulatory issue, while misclassifications of people above the legal age is an efficiency issue primarily concerning the commercial user of the age estimation system.