91.8CYApr 3
A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE FrameworkChenyu Li, Zohaib Akhtar, Mingu Kwak et al.
As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.
AIMar 27, 2013
A Decision-Theoretic Model for Using Scientific DataHarold P. Lehmann
Many Artificial Intelligence systems depend on the agent's updating its beliefs about the world on the basis of experience. Experiments constitute one type of experience, so scientific methodology offers a natural environment for examining the issues attendant to using this class of evidence. This paper presents a framework which structures the process of using scientific data from research reports for the purpose of making decisions, using decision analysis as the basis for the structure and using medical research as the general scientific domain. The structure extends the basic influence diagram for updating belief in an object domain parameter of interest by expanding the parameter into four parts: those of the patient, the population, the study sample, and the effective study sample. The structure uses biases to perform the transformation of one parameter into another, so that, for instance, selection biases, in concert with the population parameter, yield the study sample parameter. The influence diagram structure provides decision theoretic justification for practices of good clinical research such as randomized assignment and blindfolding of care providers. The model covers most research designs used in medicine: case-control studies, cohort studies, and controlled clinical trials, and provides an architecture to separate clearly between statistical knowledge and domain knowledge. The proposed general model can be the basis for clinical epidemiological advisory systems, when coupled with heuristic pruning of irrelevant biases; of statistical workstations, when the computational machinery for calculation of posterior distributions is added; and of meta-analytic reviews, when multiple studies may impact on a single population parameter.
AIMar 6, 2013
End-User Construction of Influence Diagrams for Bayesian StatisticsHarold P. Lehmann, Ross D. Shachter
Influence diagrams are ideal knowledge representations for Bayesian statistical models. However, these diagrams are difficult for end users to interpret and to manipulate. We present a user-based architecture that enables end users to create and to manipulate the knowledge representation. We use the problem of physicians' interpretation of two-arm parallel randomized clinical trials (TAPRCT) to illustrate the architecture and its use. There are three primary data structures. Elements of statistical models are encoded as subgraphs of a restricted class of influence diagram. The interpretations of those elements are mapped into users' language in a domain-specific, user-based semantic interface, called a patient-flow diagram, in the TAPRCT problem. Pennitted transformations of the statistical model that maintain the semantic relationships of the model are encoded in a metadata-state diagram, called the cohort-state diagram, in the TAPRCT problem. The algorithm that runs the system uses modular actions called construction steps. This framework has been implemented in a system called THOMAS, that allows physicians to interpret the data reported from a TAPRCT.