Objective Metrics for Evaluating Large Language Models Using External Data Sources
It addresses the need for objective evaluation in educational, scientific, and other domains, but appears incremental as it builds on existing benchmarks and automation methods.
The paper tackles the problem of subjective evaluation of Large Language Models by proposing a framework that uses external data sources and benchmarks to provide consistent, reproducible, and bias-minimized measurements, resulting in a scalable solution for performance assessment in high-stakes domains.
Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.