Alejandro Benito-Santos

h-index8

4papers

21citations

Novelty15%

AI Score23

Ranked #175,993 of 194,257 authors (top 91%)#29,130 in CL (top 95%)

4 Papers

11.5CLSep 19, 2024Code

Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

Eva Sánchez Salido, Roser Morante, Julio Gonzalo et al.

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.

0.5CLJun 2, 2023Code

LyricSIM: A novel Dataset and Benchmark for Similarity Detection in Spanish Song LyricS

Alejandro Benito-Santos, Adrián Ghajari, Pedro Hernández et al.

In this paper, we present a new dataset and benchmark tailored to the task of semantic similarity in song lyrics. Our dataset, originally consisting of 2775 pairs of Spanish songs, was annotated in a collective annotation experiment by 63 native annotators. After collecting and refining the data to ensure a high degree of consensus and data integrity, we obtained 676 high-quality annotated pairs that were used to evaluate the performance of various state-of-the-art monolingual and multilingual language models. Consequently, we established baseline results that we hope will be useful to the community in all future academic and industrial applications conducted in this context.

3.3HCSep 4, 2020

Pilaster: A Collection of Citation Metadata Extracted From Publications on Visualization for the Digital Humanities

Alejandro Benito-Santos, Roberto Therón

In this paper, we present Pilaster (https://visusal.github.io/pilaster/), a collection of citation metadata extracted from publications in visualization for the digital humanities. The collection is generated from a seed set of relevant publications from which we extracted cited works, including journal and conference papers, books, theses, or blog posts, among other resources. The main aim of this work revolves around three main points: first, the collection may serve as an entry point to the discipline for digital humanists and visualization scholars without previous experience in the field. Second, Pilaster can be regarded as a meeting point for more established visualization or humanities scholars seeking to collaborate in the development of novel research ideas and related visualization design studies in the context of the humanities. Third, and given the large amount of visualization design spaces that were captured, we believe the dataset has the potential to become the starting point for future studies aimed at understanding the particularities of problem-driven visualization research in this and other contexts.

5.8HCSep 4, 2020

GlassViz: Visualizing Automatically-Extracted Entry Points for Exploring Scientific Corpora in Problem-Driven Visualization Research

Alejandro Benito-Santos, Roberto Therón

In this paper, we report the development of a model and a proof-of-concept visual text analytics (VTA) tool to enhance documentdiscovery in a problem-driven visualization research (PDVR) con-text. The proposed model captures the cognitive model followed bydomain and visualization experts by analyzing the interdisciplinarycommunication channel as represented by keywords found in twodisjoint collections of research papers. High distributional inter-collection similarities are employed to build informative keywordassociations that serve as entry points to drive the exploration of alarge document corpus. Our approach is demonstrated in the contextof research on visualization for the digital humanities.