CL AIAug 29, 2025

BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

João Guilherme Alves Santos, Giovana Kerche Bonás, Thales Sales Almeida

arXiv:2508.21294v16.72 citationsh-index: 4Has CodeAnais do XXII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2025)

Originality Synthesis-oriented

AI Analysis

This work addresses data contamination studies in LLM pretraining, particularly for multilingual and non-English evaluation, but is incremental as it builds on an existing dataset.

The authors tackled the need for robust LLM evaluation in multilingual contexts by updating the BLUEX dataset with 2024-2025 exams and automatic image captions, resulting in a more than 40% increase in accessibility for text-only models and over 1,422 usable questions, more than doubling the original dataset size.

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

View on arXiv PDF

Similar