Jaden Stryker

CL
h-index50
5papers
14citations
Novelty50%
AI Score34

5 Papers

CLMar 13, 2025
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

Shrutika Singh, Anton Alyakin, Daniel Alexander Alber et al.

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

AIDec 14, 2024
MedG-KRP: Medical Graph Knowledge Representation Probing

Gabriel R. Rosenbaum, Lavender Yao Jiang, Ivaxi Sheth et al.

Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. LLMs' ability to coalesce vast amounts of information from many sources to generate a response-a process similar to that of a human expert-has led many to see potential in deploying LLMs for clinical use. However, medicine is a setting where accurate reasoning is paramount. Many researchers are questioning the effectiveness of multiple choice question answering (MCQA) benchmarks, frequently used to test LLMs. Researchers and clinicians alike must have complete confidence in LLMs' abilities for them to be deployed in a medical setting. To address this need for understanding, we introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. Essentially, we map how LLMs link medical concepts in order to better understand how they reason. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model. We enlist a panel of medical students to review a total of 60 LLM-generated graphs and compare these graphs to BIOS, a large biomedical KG. We observe GPT-4 to perform best in our human review but worst in our ground truth comparison; vice-versa with PalmyraMed, the medical model. Our work provides a means of visualizing the medical reasoning pathways of LLMs so they can be implemented in clinical settings safely and effectively.

CLNov 17, 2025
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han et al.

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

AIFeb 26, 2025
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

Anton Alyakin, Jaden Stryker, Daniel Alexander Alber et al.

General-purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data posse critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized deployment trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or GPT-4o as a diagnostic co-pilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults. CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.

CLOct 11, 2024
MedMobile: A mobile-sized language model with clinical capabilities

Krithik Vishwanath, Jaden Stryker, Anton Alyakin et al.

Language models (LMs) have demonstrated expert-level reasoning and recall abilities in medicine. However, computational costs and privacy concerns are mounting barriers to wide-scale implementation. To address these significant limitations, we introduce a parsimonious adaptation of phi-3-mini, MedMobile, a 3.8 billion parameter LM capable of running on a mobile device, for medical applications. We perform a careful set of pipeline additions and demonstrate that chain of thought, ensembling, and fine-tuning lead to the greatest performance gains, while unexpectedly retrieval augmented generation fails to demonstrate significant improvements. We evaluate the efficiency of our pipeline on the MultiMedQA and MedBullets. We demonstrate that MedMobile scores 75.7% on the MedQA (USMLE), surpassing the passing mark for licensed physicians (~60%) and rivaling scores of models 100 times its size. Across the entirety of the MultiMedQA, MedMobile achieves SOTA performance for models with less than 5B parameters and represents the smallest model to pass the MedQA (USMLE). MedMobile holds promise to democratize access to language models in medicine, bolstering lower compute needs and fast inference speeds. With the ability to combat the biggest barriers to entry for language models in medicine, we hope that MedMobile is a critical step forward in developing clinically relevant language models.