CL AI LGFeb 24, 2022

Capturing Failures of Large Language Models via Human Cognitive Biases

arXiv:2202.12299v211.4133 citations

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation methods for AI systems in real-world applications, though it is incremental by applying existing cognitive science methodologies to a new domain.

The paper tackled the problem of assessing the reliability of large language models in open-ended generation tasks by identifying qualitative categories of errors, using human cognitive biases as inspiration to hypothesize and test for issues like framing effects and anchoring. The results showed that OpenAI's Codex predictably errs based on prompt framing and training data mimicry, leading to high-impact errors such as incorrectly deleting files.

Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.

View on arXiv PDF

Similar