CL LGFeb 16, 2024

Exploring Precision and Recall to assess the quality and diversity of LLMs

Florian Le Bronnec, Alexandre Verine, Benjamin Negrevergne, Yann Chevaleyre, Alexandre Allauzen

arXiv:2402.10693v319.138 citationsh-index: 31Has CodeACL

Originality Incremental advance

AI Analysis

This work provides a new evaluation framework for NLP researchers and practitioners to assess LLMs in open-ended tasks, though it is incremental as it adapts existing metrics to a new domain.

The authors tackled the problem of evaluating quality and diversity in large language model text generation by adapting Precision and Recall metrics from image generation, revealing a trade-off between these aspects in models like Llama-2 and Mistral, particularly after fine-tuning.

We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.

View on arXiv PDF Code

Similar