A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
This addresses the need for comprehensive evaluation frameworks in healthcare AI to improve trust and safety, though it is incremental as it builds on existing evaluation concepts.
The paper tackles the problem of evaluating large language models in healthcare by proposing the S.C.O.R.E. framework, which expands beyond traditional metrics to include safety, consensus, objectivity, reproducibility, and explainability, aiming to ensure models are safe, reliable, and ethical for clinical use.
A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.