Kimberly Glasgow

HCJan 31, 2022

Won't you see my neighbor?: User predictions, mental models, and similarity-based explanations of AI classifiers

Kimberly Glasgow, Jonathan Kopecky, John Gersh et al.

Humans should be able work more effectively with artificial intelligence-based systems when they can predict likely failures and form useful mental models of how the systems work. We conducted a study of human's mental models of artificial intelligence systems using a high-performing image classifier, focusing on participants' ability to predict the classification result for a particular image. Participants viewed individual labeled images in one of two classes and then tried to predict whether the classifier would label them correctly. In this experiment we explored the effect of giving participants additional information about an image's nearest neighbors in a space representing the otherwise uninterpretable features extracted by the lower layers of the classifier's neural network. We found that providing this information did increase participants' prediction performance, and that the performance improvement could be related to the neighbor images' similarity to the target image. We also found indications that the presentation of this information may influence people's own classification of the target image -- that is, rather than just anthropomorphizing the system, in some cases the humans become "mechanomorphized" in their judgements.

CLMar 23, 2016

Evaluating semantic models with word-sentence relatedness

Kimberly Glasgow, Matthew Roos, Amy Haufler et al.

Semantic textual similarity (STS) systems are designed to encode and evaluate the semantic similarity between words, phrases, sentences, and documents. One method for assessing the quality or authenticity of semantic information encoded in these systems is by comparison with human judgments. A data set for evaluating semantic models was developed consisting of 775 English word-sentence pairs, each annotated for semantic relatedness by human raters engaged in a Maximum Difference Scaling (MDS) task, as well as a faster alternative task. As a sample application of this relatedness data, behavior-based relatedness was compared to the relatedness computed via four off-the-shelf STS models: n-gram, Latent Semantic Analysis (LSA), Word2Vec, and UMBC Ebiquity. Some STS models captured much of the variance in the human judgments collected, but they were not sensitive to the implicatures and entailments that were processed and considered by the participants. All text stimuli and judgment data have been made freely available.

Kimberly Glasgow

2 Papers