CLJul 11, 2023

BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

arXiv:2307.05410v117 citationsh-index: 32Has Code
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for advancing NLP in Portuguese, though it is incremental as it adapts existing evaluation methods to a new language context.

The authors tackled the lack of high-quality Portuguese datasets for evaluating language models by introducing BLUEX, a benchmark based on Brazilian university entrance exams, and demonstrated its potential through experiments with state-of-the-art models.

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes