CL AI SEFeb 17, 2025

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu

arXiv:2502.11829v115.59 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation of multimodal LLMs in coding tasks, but it is incremental as it builds on existing benchmarks like MMCode and MathVista.

The paper tackles the problem of evaluating multimodal large language models' logical understanding and code generation by introducing Code-Vision, a benchmark based on flowcharts, and finds that proprietary models like GPT-4o achieve 79.3% pass@1 on hard problems, while open-source models only reach 15%.

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.

View on arXiv PDF Code

Similar