CLSep 19, 2024Code
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal ContaminationEva Sánchez Salido, Roser Morante, Julio Gonzalo et al.
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.
CLJul 1, 2024
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?Guillermo Marco, Julio Gonzalo, Ramón del Castillo et al.
It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent's. Then, we prepared an evaluation rubric inspired by Boden's definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.
CLSep 17, 2024
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMsGuillermo Marco, Luz Rello, Julio Gonzalo
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.
CLFeb 18, 2025Code
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation BenchmarksEva Sánchez Salido, Julio Gonzalo, Guillermo Marco
In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs' answers.
CLSep 29, 2025
BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and NotificationsAndrés Fernández García, Javier de la Rosa, Julio Gonzalo et al.
The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain's ``Boletín Oficial del Estado'' (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model -- BERTIN GPT-J 6B (32-bit precision) -- achieves a 24\% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6\% vs.\ 33.5\%).
CLJun 3, 2025
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative WritingGuillermo Marco, Julio Gonzalo, Víctor Fresno
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared "preference space". Reader vectors cluster into two profiles: 'surface-focused readers' (mainly non-experts), who prioritize readability and textual richness; and 'holistic readers' (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.