CV AIOct 27, 2025

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

arXiv:2510.24792v21 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This provides a resource for advancing multilingual multimodal reasoning research, though it is incremental as it addresses a known bottleneck in evaluation rather than proposing new methods.

The authors tackled the lack of high-quality, multilingual benchmarks for vision-language models by introducing PISA-Bench, a dataset derived from expert-created PISA tests translated into six languages, and found that small models (<20B parameters) perform poorly, with substantial degradation in non-English languages and high error rates in spatial/geometric reasoning.

Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

View on arXiv PDF

Similar