CLFeb 4, 2025

Evalita-LLM: Benchmarking Large Language Models on Italian

Bernardo Magnini, Roberto Zanoli, Michele Resta, Martin Cimmino, Paolo Albano, Marco Madeddu, Viviana Patti

arXiv:2502.02289v113.08 citationsh-index: 13

Originality Synthesis-oriented

AI Analysis

This addresses the need for better evaluation of LLMs in Italian, though it is incremental as it adapts existing benchmarking approaches to a specific language.

The authors tackled the problem of evaluating large language models (LLMs) on Italian tasks by introducing Evalita-LLM, a benchmark with native Italian tasks, generative components, and multiple prompts, resulting in performance statistics for several state-of-the-art LLMs.

We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.

View on arXiv PDF

Similar