CLLGJul 15, 2024

Evaluating Large Language Models with fmeval

arXiv:2407.12872v14 citationsh-index: 12Has Code
Originality Synthesis-oriented
AI Analysis

This provides a tool for practitioners to assess LLMs, but it is incremental as it focuses on library development rather than new evaluation methods.

The paper introduces fmeval, an open-source library for evaluating large language models across tasks and responsible AI dimensions, demonstrating its use in selecting a model for question answering.

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes