Fred Cohen

20.8AIJun 3

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

Alejandro Lozano, Keiko Ihara, Ping-Hao Yang et al.

Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

SYOct 12, 2014

Bounds on Multiple Sensor Fusion

Bill Moran, Fred Cohen, Zengfu Wang et al.

We consider the problem of fusing measurements from multiple sensors, where the sensing regions overlap and data are non-negative---possibly resulting from a count of indistinguishable discrete entities. Because of overlaps, it is, in general, impossible to fuse this information to arrive at an accurate estimate of the overall amount or count of material present in the union of the sensing regions. Here we study the range of overall values consistent with the data. Posed as a linear programming problem, this leads to interesting questions associated with the geometry of the sensor regions, specifically, the arrangement of their non-empty intersections. We define a computational tool called the fusion polytope and derive a condition for this to be in the positive orthant thus simplifying calculations. We show that, in two dimensions, inflated tiling schemes based on rectangular regions fail to satisfy this condition, whereas inflated tiling schemes based on hexagons do.

Fred Cohen

2 Papers