Evaluating LLMs with Multiple Problems at once
This addresses the need for more efficient and realistic evaluation methods for LLMs, though it is incremental as it builds on existing benchmarks.
The paper introduces multi-problem evaluation (MPE), a paradigm where LLMs are tested by presenting multiple problems in a single prompt, and creates ZeMPE, a benchmark with 53,100 zero-shot multi-problem prompts. Results show LLMs can handle multiple problems from a single data source similarly to separate problems, but with limitations under certain conditions.
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.