Evaluating Large Language Models with fmeval
This provides a tool for practitioners to assess LLMs, but it is incremental as it focuses on library development rather than new evaluation methods.
The paper introduces fmeval, an open-source library for evaluating large language models across tasks and responsible AI dimensions, demonstrating its use in selecting a model for question answering.
fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.