GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
This work addresses chart reading accuracy for users of multimodal AI systems, but it is incremental as it focuses on comparing two specific models.
The study tackled the problem of chart reading errors by comparing the inference accuracies of GPT-5 and GPT-4V on 107 visualization questions, finding that GPT-5 significantly improved accuracy over GPT-4V, with prompt variants having minimal effects.
We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.