HCMay 17
Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision OncologyZeshan Hussain, Barbara D. Lam, Fernando A. Acosta-Perez et al.
As machine learning (ML)-based decision support tools proliferate in clinical practice, understanding how clinicians integrate personalized ML predictions alongside randomized controlled trial (RCT) evidence is critical. We designed a web-based clinical decision support system (CDSS) presenting survival and adverse event data from a simulated RCT and ML model across 12 synthetic multiple myeloma scenarios. In a within- subjects study with 32 physicians, we evaluated how clinicians synthesize competing evidence sources to make treatment decisions. When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed. Taken together, these findings reveal that clinicians may over-rely on ML recommendations even when equipped with tools designed to support critical appraisal. We discuss implications for CDSS design, clinician training, and the institutional safeguards needed before ML-based systems are deployed in high-stakes oncology settings.
CLApr 28
Diagnosis, Bad Planning & Reasoning. Treatment, SCOPE -- Planning for Hybrid Querying over Clinical Trial DataSuparno Roy Chowdhury, Manan Roy Choudhury, Tejas Anvekar et al.
We study clinical trial table reasoning, where answers are not directly stored in visible cells but must be reasoned from semantic understanding through normalization, classification, extraction, or lightweight domain reasoning. Motivated by the observation that current LLM approaches often suffer from "bad reasoning" under implicit planning assumptions, we focus on settings in which the model must recover implicit attributes such as therapy type, added agents, endpoint roles, or follow-up status from partially observed clinical-trial tables. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, Blend-SQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution
CLApr 17
FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with UseSuparno Roy Chowdhury, Tejas Anvekar, Manan Roy Choudhury et al.
Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.
CLApr 21
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic ReviewsNaman Ahuja, Saniya Mulla, Muhammad Ali Khan et al.
We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.
CLOct 20, 2025
Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to ApplicationsXiao Ye, Jacob Dineen, Zhaonan Li et al.
Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.