Enterprise Large Language Model Evaluation Benchmark
This work addresses the need for tailored LLM evaluations in enterprise settings, offering actionable insights for model optimization, though it is incremental as it builds on existing benchmarking methods.
The authors tackled the problem of inadequate evaluation benchmarks for large language models (LLMs) in enterprise contexts by proposing a 14-task framework based on Bloom's Taxonomy, resulting in a curated 9,700-sample benchmark that reveals performance gaps, such as open-source models like DeepSeek R1 rivaling proprietary ones in reasoning but lagging in judgment tasks.
Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom's Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-a-Judge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.