CLSep 28, 2023
How many words does ChatGPT know? The answer is ChatWordsGonzalo Martínez, Javier Conde, Pedro Reviriego et al.
The introduction of ChatGPT has put Artificial Intelligence (AI) Natural Language Processing (NLP) in the spotlight. ChatGPT adoption has been exponential with millions of users experimenting with it in a myriad of tasks and application domains with impressive results. However, ChatGPT has limitations and suffers hallucinations, for example producing answers that look plausible but they are completely wrong. Evaluating the performance of ChatGPT and similar AI tools is a complex issue that is being explored from different perspectives. In this work, we contribute to those efforts with ChatWords, an automated test system, to evaluate ChatGPT knowledge of an arbitrary set of words. ChatWords is designed to be extensible, easy to use, and adaptable to evaluate also other NLP AI tools. ChatWords is publicly available and its main goal is to facilitate research on the lexical knowledge of AI tools. The benefits of ChatWords are illustrated with two case studies: evaluating the knowledge that ChatGPT has of the Spanish lexicon (taken from the official dictionary of the "Real Academia Española") and of the words that appear in the Quixote, the well-known novel written by Miguel de Cervantes. The results show that ChatGPT is only able to recognize approximately 80% of the words in the dictionary and 90% of the words in the Quixote, in some cases with an incorrect meaning. The implications of the lexical knowledge of NLP AI tools and potential applications of ChatWords are also discussed providing directions for further work on the study of the lexical knowledge of AI tools.
LGAug 26, 2024
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of ThingsZiheng Wang, Pedro Reviriego, Farzad Niknia et al.
The implementation of machine learning in Internet of Things devices poses significant operational challenges due to limited energy and computation resources. In recent years, significant efforts have been made to implement simplified ML models that can achieve reasonable performance while reducing computation and energy, for example by pruning weights in neural networks, or using reduced precision for the parameters and arithmetic operations. However, this type of approach is limited by the performance of the ML implementation, i.e., by the loss for example in accuracy due to the model simplification. In this article, we present adaptive resolution inference (ARI), a novel approach that enables to evaluate new tradeoffs between energy dissipation and model performance in ML implementations. The main principle of the proposed approach is to run inferences with reduced precision (quantization) and use the margin over the decision threshold to determine if either the result is reliable, or the inference must run with the full model. The rationale is that quantization only introduces small deviations in the inference scores, such that if the scores have a sufficient margin over the decision threshold, it is unlikely that the full model would have a different result. Therefore, we can run the quantized model first, and only when the scores do not have a sufficient margin, the full model is run. This enables most inferences to run with the reduced precision model and only a small fraction requires the full model, so significantly reducing computation and energy while not affecting model performance. The proposed ARI approach is presented, analyzed in detail, and evaluated using different data sets for floating-point and stochastic computing implementations. The results show that ARI can significantly reduce the energy for inference in different configurations with savings between 40% and 85%.
20.8CLMay 26
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)Samer Awad, Javier Conde, Carlos Arriaga et al.
Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.
CLAug 16, 2024
Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousalGonzalo Martínez, Juan Diego Molero, Sandra González et al.
This study investigates the potential of large language models (LLMs) to provide accurate estimates of concreteness, valence and arousal for multi-word expressions. Unlike previous artificial intelligence (AI) methods, LLMs can capture the nuanced meanings of multi-word expressions. We systematically evaluated ChatGPT-4o's ability to predict concreteness, valence and arousal. In Study 1, ChatGPT-4o showed strong correlations with human concreteness ratings (r = .8) for multi-word expressions. In Study 2, these findings were repeated for valence and arousal ratings of individual words, matching or outperforming previous AI models. Study 3 extended the prevalence and arousal analysis to multi-word expressions and showed promising results despite the lack of large-scale human benchmarks. These findings highlight the potential of LLMs for generating valuable psycholinguistic data related to multiword expressions. To help researchers with stimulus selection, we provide datasets with AI norms of concreteness, valence and arousal for 126,397 English single words and 63,680 multi-word expressions
CLAug 14, 2023
Playing with words: Comparing the vocabulary and lexical diversity of ChatGPT and humansPedro Reviriego, Javier Conde, Elena Merino-Gómez et al.
The introduction of Artificial Intelligence (AI) generative language models such as GPT (Generative Pre-trained Transformer) and tools such as ChatGPT has triggered a revolution that can transform how text is generated. This has many implications, for example, as AI-generated text becomes a significant fraction of the text, would this have an effect on the language capabilities of readers and also on the training of newer AI tools? Would it affect the evolution of languages? Focusing on one specific aspect of the language: words; will the use of tools such as ChatGPT increase or reduce the vocabulary used or the lexical richness? This has implications for words, as those not included in AI-generated content will tend to be less and less popular and may eventually be lost. In this work, we perform an initial comparison of the vocabulary and lexical richness of ChatGPT and humans when performing the same tasks. In more detail, two datasets containing the answers to different types of questions answered by ChatGPT and humans, and a third dataset in which ChatGPT paraphrases sentences and questions are used. The analysis shows that ChatGPT tends to use fewer distinct words and lower lexical richness than humans. These results are very preliminary and additional datasets and ChatGPT configurations have to be evaluated to extract more general conclusions. Therefore, further research is needed to understand how the use of ChatGPT and more broadly generative AI tools will affect the vocabulary and lexical richness in different types of text and languages.
CLOct 23, 2023
Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language ModelsGonzalo Martínez, Javier Conde, Elena Merino-Gómez et al.
Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding and production. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
CLSep 8, 2024
Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?Marina Mayor-Rocher, Nina Melero, Elena Merino-Gómez et al.
Large Language Models (LLMs) have been profusely evaluated on their ability to answer questions on many topics and their performance on different natural language understanding tasks. Those tests are usually conducted in English, but most LLM users are not native English speakers. Therefore, it is of interest to analyze how LLMs understand other languages at different levels: from paragraphs to morphems. In this paper, we evaluate the performance of state-of-the-art LLMs in TELEIA, a recently released benchmark with similar questions to those of Spanish exams for foreign students, covering topics such as reading comprehension, word formation, meaning and compositional semantics, and grammar. The results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
SDSep 9, 2024
Assessing Latency in ASR Systems: A Methodological Perspective for Real-Time UseCarlos Arriaga, Alejandro Pozo, Javier Conde et al.
Automatic speech recognition (ASR) systems generate real-time transcriptions but often miss nuances that human interpreters capture. While ASR is useful in many contexts, interpreters-who already use ASR tools such as Dragon-add critical value, especially in sensitive settings such as diplomatic meetings where subtle language is key. Human interpreters not only perceive these nuances but can adjust in real time, improving accuracy, while ASR handles basic transcription tasks. However, ASR systems introduce a delay that does not align with real-time interpretation needs. The user-perceived latency of ASR systems differs from that of interpretation because it measures the time between speech and transcription delivery. To address this, we propose a new approach to measuring delay in ASR systems and validate if they are usable in live interpretation scenarios.
24.2CLMar 12
To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading TimesThomas Hikaru Clark, Carlos Arriaga, Javier Conde et al.
Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
CLMar 21, 2024Code
Open Conversational LLMs do not know most Spanish wordsJavier Conde, Miguel González, Nina Melero et al.
The growing interest in Large Language Models (LLMs) and in particular in conversational models with which users can interact has led to the development of a large number of open-source chat LLMs. These models are evaluated on a wide range of benchmarks to assess their capabilities in answering questions or solving problems on almost any possible topic or to test their ability to reason or interpret texts. Instead, the evaluation of the knowledge that these models have of the languages has received much less attention. For example, the words that they can recognize and use in different languages. In this paper, we evaluate the knowledge that open-source chat LLMs have of Spanish words by testing a sample of words in a reference dictionary. The results show that open-source chat LLMs produce incorrect meanings for an important fraction of the words and are not able to use most of the words correctly to write sentences with context. These results show how Spanish is left behind in the open-source LLM race and highlight the need to push for linguistic fairness in conversational LLMs ensuring that they provide similar performance across languages.
CLJul 1, 2025Code
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin AmericaMaría Grandury, Javier Aula-Blasco, Júlia Falcão et al.
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
CLDec 19, 2024
Why Do Large Language Models (LLMs) Struggle to Count Letters?Tairan Fu, Raquel Ferrando, Javier Conde et al.
Large Language Models (LLMs) have achieved unprecedented performance on many complex tasks, being able, for example, to answer questions on almost any topic. However, they struggle with other simple tasks, such as counting the occurrences of letters in a word, as illustrated by the inability of many LLMs to count the number of "r" letters in "strawberry". Several works have studied this problem and linked it to the tokenization used by LLMs, to the intrinsic limitations of the attention mechanism, or to the lack of character-level training data. In this paper, we conduct an experimental study to evaluate the relations between the LLM errors when counting letters with 1) the frequency of the word and its components in the training dataset and 2) the complexity of the counting operation. We present a comprehensive analysis of the errors of LLMs when counting letter occurrences by evaluating a representative group of models over a large number of words. The results show a number of consistent trends in the models evaluated: 1) models are capable of recognizing the letters but not counting them; 2) the frequency of the word and tokens in the word does not have a significant impact on the LLM errors; 3) there is a positive correlation of letter frequency with errors, more frequent letters tend to have more counting errors, 4) the errors show a strong correlation with the number of letters or tokens in a word and 5) the strongest correlation occurs with the number of letters with counts larger than one, with most models being unable to correctly count words in which letters appear more than twice.
CLFeb 11, 2024
Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case StudyGonzalo Martínez, José Alberto Hernández, Javier Conde et al.
The performance of conversational Large Language Models (LLMs) in general, and of ChatGPT in particular, is currently being evaluated on many different tasks, from logical reasoning or maths to answering questions on a myriad of topics. Instead, much less attention is being devoted to the study of the linguistic features of the texts generated by these LLMs. This is surprising since LLMs are models for language, and understanding how they use the language is important. Indeed, conversational LLMs are poised to have a significant impact on the evolution of languages as they may eventually dominate the creation of new text. This means that for example, if conversational LLMs do not use a word it may become less and less frequent and eventually stop being used altogether. Therefore, evaluating the linguistic features of the text they produce and how those depend on the model parameters is the first step toward understanding the potential impact of conversational LLMs on the evolution of languages. In this paper, we consider the evaluation of the lexical richness of the text generated by LLMs and how it depends on the model parameters. A methodology is presented and used to conduct a comprehensive evaluation of lexical richness using ChatGPT as a case study. The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model. The dataset and tools used in our analysis are released under open licenses with the goal of drawing the much-needed attention to the evaluation of the linguistic features of LLM-generated text.
CLJan 16, 2025
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are WrongTairan Fu, Javier Conde, Gonzalo Martínez et al.
One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.
CLFeb 23, 2025
Speed and Conversational Large Language Models: Not All Is About Tokens per SecondJavier Conde, Miguel González, Pedro Reviriego et al.
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
AIFeb 23, 2025
Understanding the Impact of Artificial Intelligence in Academic Writing: Metadata to the RescueJavier Conde, Pedro Reviriego, Joaquín Salvachúa et al.
This column advocates for including artificial intelligence (AI)-specific metadata on those academic papers that are written with the help of AI in an attempt to analyze the use of such tools for disseminating research.
15.6CVApr 19
Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges TestTairan Fu, Francisco Javier Santos-Martín, Javier Conde et al.
The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.
CLMay 29, 2025
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with HumansJavier Conde, Miguel González, María Grandury et al.
The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolor{black}{generally} better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.
CLFeb 23, 2025
Can ChatGPT Learn to Count Letters?Javier Conde, Gonzalo Martínez, Pedro Reviriego et al.
Large language models (LLMs) struggle on simple tasks such as counting the number of occurrences of a letter in a word. In this paper, we investigate if ChatGPT can learn to count letters and propose an efficient solution.
CRNov 16, 2024
Enhanced FIWARE-Based Architecture for Cyberphysical Systems With Tiny Machine Learning and Machine Learning Operations: A Case Study on Urban Mobility SystemsJavier Conde, Andrés Munoz-Arcentales, Álvaro Alonso et al.
The rise of AI and the Internet of Things is accelerating the digital transformation of society. Mobility computing presents specific barriers due to its real-time requirements, decentralization, and connectivity through wireless networks. New research on edge computing and tiny machine learning (tinyML) explores the execution of AI models on low-performance devices to address these issues. However, there are not many studies proposing agnostic architectures that manage the entire lifecycle of intelligent cyberphysical systems. This article extends a previous architecture based on FIWARE software components to implement the machine learning operations flow, enabling the management of the entire tinyML lifecycle in cyberphysical systems. We also provide a use case to showcase how to implement the FIWARE architecture through a complete example of a smart traffic system. We conclude that the FIWARE ecosystem constitutes a real reference option for developing tinyML and edge computing in cyberphysical systems.
AIMay 4, 2025
Real-time Spatial Retrieval Augmented Generation for Urban EnvironmentsDavid Nazareno Campo, Javier Conde, Álvaro Alonso et al.
The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
CLSep 17, 2025
Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratingsJavier Conde, María Grandury, Tairan Fu et al.
Word-level psycholinguistic norms lend empirical support to theories of language processing. However, obtaining such human-based measures is not always feasible or straightforward. One promising approach is to augment human norming datasets by using Large Language Models (LLMs) to predict these characteristics directly, a practice that is rapidly gaining popularity in psycholinguistics and cognitive science. However, the novelty of this approach (and the relative inscrutability of LLMs) necessitates the adoption of rigorous methodologies that guide researchers through this process, present the range of possible approaches, and clarify limitations that are not immediately apparent, but may, in some cases, render the use of LLMs impractical. In this work, we present a comprehensive methodology for estimating word characteristics with LLMs, enriched with practical advice and lessons learned from our own experience. Our approach covers both the direct use of base LLMs and the fine-tuning of models, an alternative that can yield substantial performance gains in certain scenarios. A major emphasis in the guide is the validation of LLM-generated data with human "gold standard" norms. We also present a software framework that implements our methodology and supports both commercial and open-weight models. We illustrate the proposed approach with a case study on estimating word familiarity in English. Using base models, we achieved a Spearman correlation of 0.8 with human ratings, which increased to 0.9 when employing fine-tuned models. This methodology, framework, and set of best practices aim to serve as a reference for future research on leveraging LLMs for psycholinguistic and lexical studies.
AISep 16, 2025
Stochastic Streets: A Walk Through Random LLM Address Generation in four European CitiesTairan Fu, David Campo-Nazareno, Javier Coronado-Blázquez et al.
Large Language Models (LLMs) are capable of solving complex math problems or answer difficult questions on almost any topic, but can they generate random street addresses for European cities?
LGAug 5, 2025
Energy-Efficient Stochastic Computing (SC) Neural Networks for Internet of Things Devices With Layer-Wise Adjustable Sequence Length (ASL)Ziheng Wang, Pedro Reviriego, Farzad Niknia et al.
Stochastic computing (SC) has emerged as an efficient low-power alternative for deploying neural networks (NNs) in resource-limited scenarios, such as the Internet of Things (IoT). By encoding values as serial bitstreams, SC significantly reduces energy dissipation compared to conventional floating-point (FP) designs; however, further improvement of layer-wise mixed-precision implementation for SC remains unexplored. This article introduces Adjustable Sequence Length (ASL), a novel scheme that applies mixed-precision concepts specifically to SC NNs. By introducing an operator-norm-based theoretical model, this article shows that truncation noise can cumulatively propagate through the layers by the estimated amplification factors. An extended sensitivity analysis is presented, using random forest (RF) regression to evaluate multilayer truncation effects and validate the alignment of theoretical predictions with practical network behaviors. To accommodate different application scenarios, this article proposes two truncation strategies (coarse-grained and fine-grained), which apply diverse sequence length configurations at each layer. Evaluations on a pipelined SC MLP synthesized at 32nm demonstrate that ASL can reduce energy and latency overheads by up to over 60% with negligible accuracy loss. It confirms the feasibility of the ASL scheme for IoT applications and highlights the distinct advantages of mixed-precision truncation in SC designs.
AIJul 17, 2025
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human EvaluationsCarlos Arriaga, Gonzalo Martínez, Eneko Sendin et al.
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
CLJun 23, 2025
Is There a Case for Conversation Optimized Tokenizers in Large Language Models?Raquel Ferrando, Javier Conde, Gonzalo Martínez et al.
The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.
CVJun 27, 2024
Recursive InPainting (RIP): how much information is lost under recursive inferences?Javier Conde, Miguel González, Gonzalo Martínez et al.
The rapid adoption of generative artificial intelligence (AI) is accelerating content creation and modification. For example, variations of a given content, be it text or images, can be created almost instantly and at a low cost. This will soon lead to the majority of text and images being created directly by AI models or by humans assisted by AI. This poses new risks; for example, AI-generated content may be used to train newer AI models and degrade their performance, or information may be lost in the transformations made by AI which could occur when the same content is processed over and over again by AI tools. An example of AI image modifications is inpainting in which an AI model completes missing fragments of an image. The incorporation of inpainting tools into photo editing programs promotes their adoption and encourages their recursive use to modify images. Inpainting can be applied recursively, starting from an image, removing some parts, applying inpainting to reconstruct the image, revising it, and then starting the inpainting process again on the reconstructed image, etc. This paper presents an empirical evaluation of recursive inpainting when using one of the most widely used image models: Stable Diffusion. The inpainting process is applied by randomly selecting a fragment of the image, reconstructing it, selecting another fragment, and repeating the process a predefined number of iterations. The images used in the experiments are taken from a publicly available art data set and correspond to different styles and historical periods. Additionally, photographs are also evaluated as a reference. The modified images are compared with the original ones by both using quantitative metrics and performing a qualitative analysis. The results show that recursive inpainting in some cases modifies the image so that it still resembles the original one while in others leads to degeneration.
AIMar 25, 2024
Concurrent Linguistic Error Detection (CLED): a New Methodology for Error Detection in Large Language ModelsJinhua Zhu, Javier Conde, Zhen Gao et al.
The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model's internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.