CLAINov 27, 2023

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

arXiv:2311.15930v116 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of evaluating grounded reasoning in LLMs for AI researchers, but it is incremental as it builds on existing benchmarking efforts without introducing a new method.

The authors tackled the problem of assessing whether large language models (LLMs) can sustain consistent world models by creating WorldSense, a synthetic benchmark that tests simple inferences from entity arrangements, and found that state-of-the-art models like GPT-3.5, GPT-4, and Llama2-chat make errors with as few as three objects and show heavy response biases, with errors persisting despite prompting techniques and limited generalization after fine-tuning.

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes