CL LGJul 15, 2024

Evaluating Large Language Models with fmeval

Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael Diamond, Michele Donini

arXiv:2407.12872v14.24 citationsh-index: 12Has Code

Originality Synthesis-oriented

AI Analysis

This provides a tool for practitioners to assess LLMs, but it is incremental as it focuses on library development rather than new evaluation methods.

The paper introduces fmeval, an open-source library for evaluating large language models across tasks and responsible AI dimensions, demonstrating its use in selecting a model for question answering.

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.

View on arXiv PDF Code

Similar