HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
This work addresses the problem of benchmarking high-level visual reasoning for Large Multimodal Models in coding contexts, providing a new benchmark for researchers, but it is incremental as it builds on existing evaluation frameworks.
The authors tackled the lack of comprehensive benchmarks for evaluating diagram interpretation and reasoning in coding tasks by introducing HumanEval-V, a human-annotated benchmark spanning six task types, and found that top-performing models like Claude 3.5 Sonnet achieved only 36.8% pass@1, indicating significant gaps in visual reasoning.
Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in coding contexts. We present HumanEval-V, a rigorous benchmark of human-annotated coding tasks that spans six task types and evaluates diverse visual reasoning capabilities. Each task features carefully crafted diagrams paired with function signatures and test cases, employing novel code generation tasks to thoroughly assess models' diagram comprehension. Through extensive experiments with 22 LMMs, we find that even top-performing models achieve modest success rates, with Claude 3.5 Sonnet reaching only 36.8% pass@1, highlighting substantial room for improvement. Our analysis reveals that current LMMs struggle with spatial transformations, topological relationships, and dynamic patterns that humans find intuitive. These findings provide valuable insights for advancing LMMs' visual reasoning abilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.