Probing Neural Language Models for Human Tacit Assumptions
This addresses the problem of understanding how AI models internalize human-like biases, which is crucial for natural language processing and fairness, though it is incremental as it builds on existing psychological and modeling frameworks.
The study investigated whether neural language models capture human stereotypic tacit assumptions (STAs) by using diagnostic prompts based on psychological data, finding that models are profoundly effective at retrieving associated concepts, with empirical evidence showing these representations are learned from text corpora.
Humans carry stereotypic tacit assumptions (STAs) (Prince, 1978), or propositional beliefs about generic concepts. Such associations are crucial for understanding natural language. We construct a diagnostic set of word prediction prompts to evaluate whether recent neural contextualized language models trained on large text corpora capture STAs. Our prompts are based on human responses in a psychological study of conceptual associations. We find models to be profoundly effective at retrieving concepts given associated properties. Our results demonstrate empirical evidence that stereotypic conceptual representations are captured in neural models derived from semi-supervised linguistic exposure.