CLDec 4, 2025
Challenging the Abilities of Large Language Models in Italian: a Community InitiativeMalvina Nissim, Danilo Croce, Viviana Patti et al.
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
CLFeb 19
Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in ItalianPietro Ferrazzi, Mattia Franzin, Alberto Lavelli et al.
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
CLFeb 26Code
Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency DepartmentGabriela Anna Kaczmarek, Pietro Ferrazzi, Lorenzo Porta et al.
Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs' results are affected by biases (e.g., a cautious behaviour favours "unknown" answers), which need to be corrected.
CLApr 11, 2024Code
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical DomainIker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar et al.
Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.
AIMay 19, 2025Code
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language ModelsMatteo Merler, Nicola Dainese, Minttu Alakuijala et al.
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.
CLMar 26, 2025Code
Low-resource Information Extraction with the European Clinical Case CorpusSoumitra Ghosh, Begona Altuna, Saeed Farzi et al.
We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .
CLJul 16, 2024
Evaluating Task-Oriented Dialogue Consistency through Constraint SatisfactionTiziano Labruna, Bernardo Magnini
Task-oriented dialogues must maintain consistency both within the dialogue itself, ensuring logical coherence across turns, and with the conversational domain, accurately reflecting external knowledge. We propose to conceptualize dialogue consistency as a Constraint Satisfaction Problem (CSP), wherein variables represent segments of the dialogue referencing the conversational domain, and constraints among variables reflect dialogue properties, including linguistic, conversational, and domain-based aspects. To demonstrate the feasibility of the approach, we utilize a CSP solver to detect inconsistencies in dialogues re-lexicalized by an LLM. Our findings indicate that: (i) CSP is effective to detect dialogue inconsistencies; and (ii) consistent dialogue re-lexicalization is challenging for state-of-the-art LLMs, achieving only a 0.15 accuracy rate when compared to a CSP solver. Furthermore, through an ablation study, we reveal that constraints derived from domain knowledge pose the greatest difficulty in being respected. We argue that CSP captures core properties of dialogue consistency that have been poorly considered by approaches based on component pipelines.
CLJun 13, 2025Code
Converting Annotated Clinical Cases into Structured Case Report FormsPietro Ferrazzi, Alberto Lavelli, Bernardo Magnini
Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at https://huggingface.co/collections/NLP-FBK/e3c-to-crf-67b9844065460cbe42f80166
CLFeb 4, 2025
Evalita-LLM: Benchmarking Large Language Models on ItalianBernardo Magnini, Roberto Zanoli, Michele Resta et al.
We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.
CLFeb 24, 2025
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA BenchmarkDavide Testa, Giovanni Bonetta, Raffaella Bernardi et al.
We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models' fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.
CLMay 23, 2023
Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and AnnotationsTiziano Labruna, Sofia Brenna, Andrea Zaninello et al.
Large pre-trained language models have exhibited unprecedented capabilities in producing high-quality text via prompting techniques. This fact introduces new possibilities for data collection and annotation, particularly in situations where such data is scarce, complex to gather, expensive, or even sensitive. In this paper, we explore the potential of these models to generate and annotate goal-oriented dialogues, and conduct an in-depth analysis to evaluate their quality. Our experiments employ ChatGPT, and encompass three categories of goal-oriented dialogues (task-oriented, collaborative, and explanatory), two generation modes (interactive and one-shot), and two languages (English and Italian). Based on extensive human-based evaluations, we demonstrate that the quality of generated dialogues and annotations is on par with those generated by humans.
CLNov 1, 2020
Recent Neural Methods on Slot Filling and Intent Classification for Task-Oriented Dialogue Systems: A SurveySamuel Louvan, Bernardo Magnini
In recent years, fostered by deep learning technologies and by the high demand for conversational AI, various approaches have been proposed that address the capacity to elicit and understand user's needs in task-oriented dialogue systems. We focus on two core tasks, slot filling (SF) and intent classification (IC), and survey how neural-based models have rapidly evolved to address natural language understanding in dialogue systems. We introduce three neural architectures: independent model, which model SF and IC separately, joint models, which exploit the mutual benefit of the two tasks simultaneously, and transfer learning models, that scale the model to new domains. We discuss the current state of the research in SF and IC and highlight challenges that still require attention.
CLSep 8, 2020
Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent ClassificationSamuel Louvan, Bernardo Magnini
Neural-based models have achieved outstanding performance on slot filling and intent classification, when fairly large in-domain training data are available. However, as new domains are frequently added, creating sizeable data is expensive. We show that lightweight augmentation, a set of augmentation methods involving word span and sentence level operations, alleviates data scarcity problems. Our experiments on limited data settings show that lightweight augmentation yields significant performance improvement on slot filling on the ATIS and SNIPS datasets, and achieves competitive performance with respect to more complex, state-of-the-art, augmentation approaches. Furthermore, lightweight augmentation is also beneficial when combined with pre-trained LM-based models, as it improves BERT-based joint intent and slot filling models.
CLMar 30, 2020
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual EuropeGeorg Rehm, Katrin Marheinecke, Stefanie Hegele et al.
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
CLJan 21, 2020
Domain-Aware Dialogue State Tracker for Multi-Domain Dialogue SystemsVevake Balaraman, Bernardo Magnini
In task-oriented dialogue systems the dialogue state tracker (DST) component is responsible for predicting the state of the dialogue based on the dialogue history. Current DST approaches rely on a predefined domain ontology, a fact that limits their effective usage for large scale conversational agents, where the DST constantly needs to be interfaced with ever-increasing services and APIs. Focused towards overcoming this drawback, we propose a domain-aware dialogue state tracker, that is completely data-driven and it is modeled to predict for dynamic service schemas. The proposed model utilizes domain and slot information to extract both domain and slot specific representations for a given dialogue, and then uses such representations to predict the values of the corresponding slot. Integrating this mechanism with a pretrained language model (i.e. BERT), our approach can effectively learn semantic relations.
CLNov 1, 2019
A Robust Data-Driven Approach for Dialogue State Tracking of Unseen Slot ValuesVevake Balaraman, Bernardo Magnini
A Dialogue State Tracker is a key component in dialogue systems which estimates the beliefs of possible user goals at each dialogue turn. Deep learning approaches using recurrent neural networks have shown state-of-the-art performance for the task of dialogue state tracking. Generally, these approaches assume a predefined candidate list and struggle to predict any new dialogue state values that are not seen during training. This makes extending the candidate list for a slot without model retaining infeasible and also has limitations in modelling for low resource domains where training data for slot values are expensive. In this paper, we propose a novel dialogue state tracker based on copying mechanism that can effectively track such unseen slot values without compromising performance on slot values seen during training. The proposed model is also flexible in extending the candidate list without requiring any retraining or change in the model. We evaluate the proposed model on various benchmark datasets (DSTC2, DSTC3 and WoZ2.0) and show that our approach, outperform other end-to-end data-driven approaches in tracking unseen slot values and also provides significant advantages in modelling for DST.
CLOct 22, 2019
Scalable Neural Dialogue State TrackingVevake Balaraman, Bernardo Magnini
A Dialogue State Tracker (DST) is a key component in a dialogue system aiming at estimating the beliefs of possible user goals at each dialogue turn. Most of the current DST trackers make use of recurrent neural networks and are based on complex architectures that manage several aspects of a dialogue, including the user utterance, the system actions, and the slot-value pairs defined in a domain ontology. However, the complexity of such neural architectures incurs into a considerable latency in the dialogue state prediction, which limits the deployments of the models in real-world applications, particularly when task scalability (i.e. amount of slots) is a crucial factor. In this paper, we propose an innovative neural model for dialogue state tracking, named Global encoder and Slot-Attentive decoders (G-SAT), which can predict the dialogue state with a very low latency time, while maintaining high-level performance. We report experiments on three different languages (English, Italian, and German) of the WoZ2.0 dataset, and show that the proposed approach provides competitive advantages over state-of-art DST systems, both in terms of accuracy and in terms of time complexity for predictions, being over 15 times faster than the other systems.
CLOct 3, 2016
Distributed Representations of Lexical Sets and Prototypes in Causal Alternation VerbsEdoardo Maria Ponti, Elisabetta Jezek, Bernardo Magnini
Lexical sets contain the words filling an argument slot of a verb, and are in part determined by selectional preferences. The purpose of this paper is to unravel the properties of lexical sets through distributional semantics. We investigate 1) whether lexical set behave as prototypical categories with a centre and a periphery; 2) whether they are polymorphic, i.e. composed by subcategories; 3) whether the distance between lexical sets of different arguments is explanatory of verb properties. In particular, our case study are lexical sets of causative-inchoative verbs in Italian. Having studied several vector models, we find that 1) based on spatial distance from the centroid, object fillers are scattered uniformly across the category, whereas intransitive subject fillers lie on its edge; 2) a correlation exists between the amount of verb senses and that of clusters discovered automatically, especially for intransitive subjects; 3) the distance between the centroids of object and intransitive subject is correlated with other properties of verbs, such as their cross-lingual tendency to appear in the intransitive pattern rather than transitive one. This paper is noncommittal with respect to the hypothesis that this connection is underpinned by a semantic reason, namely the spontaneity of the event denoted by the verb.