CL CYAug 26, 2024

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du

arXiv:2408.14438v48.718 citationsh-index: 20

Originality Synthesis-oriented

AI Analysis

It addresses the problem of assessing spatial capabilities in LLMs for AI researchers, but is incremental as it focuses on benchmarking rather than new methods.

This study tackled the lack of evaluation of large language models on spatial tasks by introducing a new multi-task spatial dataset and testing models like GPT-4o, which achieved 71.3% average accuracy in zero-shot testing, with prompt strategies like Chain-of-Thought boosting accuracy in specific tasks from 12.4% to 87.5%.

The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

View on arXiv PDF

Similar