Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
This work addresses the problem of evaluating AI reasoning capabilities for researchers and developers, but it is incremental as it extends prior research with more detailed prompting and multimodal testing.
The study compared the abstract reasoning abilities of GPT-4 and GPT-4V with humans using the ConceptARC benchmark, finding that neither AI model achieved robust, human-like abstraction levels.
We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.