CL AIJul 10, 2024

A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

Ting Fang Tan, Kabilan Elangovan, Jasmine Ong, Nigam Shah, Joseph Sung, Tien Yin Wong, Lan Xue, Nan Liu, Haibo Wang, Chang Fu Kuo, Simon Chesterman, Zee Kin Yeong

arXiv:2407.07666v13.413 citationsh-index: 18

Originality Synthesis-oriented

AI Analysis

This addresses the need for comprehensive evaluation frameworks in healthcare AI to improve trust and safety, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the problem of evaluating large language models in healthcare by proposing the S.C.O.R.E. framework, which expands beyond traditional metrics to include safety, consensus, objectivity, reproducibility, and explainability, aiming to ensure models are safe, reliable, and ethical for clinical use.

A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.

View on arXiv PDF

Similar