Nathan Wolfrath

CL
h-index5
3papers
14citations
Novelty28%
AI Score32

3 Papers

47.0DLMar 15
Rising Prevalence of Detected AI-Generated Text in Medical Literature: Longitudinal Analysis in Open Access Articles

Nathan Wolfrath, Simrin Patel, Madelyn Flitcroft et al.

Generative artificial intelligence (AI) tools are becoming increasingly used for writing tasks. However, the extent of their use in peer-reviewed medical literature remains unclear. We conducted a longitudinal analysis of all Original Investigations, Research Letters, and Invited Commentaries published in JAMA Network Open from January 2022 through March 2025. The main body text of 7,251 articles was analyzed using a commercial AI-detection tool (Originality.AI) to estimate the probability that manuscripts contained a significant amount of AI-generated content. Articles were analyzed aggregated by month, publication type, and domain. Overall, 195 articles (2.7%) were classified as containing significant AI-generated text. The monthly proportion increased from 0.0% in January 2022 to 11.3% in March 2025, with a significant upward trend over time (P<0.001). Invited Commentaries had the highest proportion of detected AI-generated content (6.7%), followed by Original Investigations (2.2%) and Research Letters (1.4%). There was also significant variation across publication domain (P=0.04). Only 15 articles (0.2%) disclosed large language model use, of which 40.0% were classified as containing AI-generated text. While findings suggest increasing detectable AI-generated content in medical literature, limitations of current detection tools necessitates cautious interpretation.

LGSep 18, 2024
Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

Nathan Wolfrath, Joel Wolfrath, Hengrui Hu et al.

Machine Learning (ML) research has increased substantially in recent years, due to the success of predictive modeling across diverse application domains. However, well-known barriers exist when attempting to deploy ML models in high-stakes, clinical settings, including lack of model transparency (or the inability to audit the inference process), large training data requirements with siloed data sources, and complicated metrics for measuring model utility. In this work, we show empirically that including stronger baseline models in healthcare ML evaluations has important downstream effects that aid practitioners in addressing these challenges. Through a series of case studies, we find that the common practice of omitting baselines or comparing against a weak baseline model (e.g. a linear model with no optimization) obscures the value of ML methods proposed in the research literature. Using these insights, we propose some best practices that will enable practitioners to more effectively study and deploy ML models in clinical settings.

CLApr 23, 2024
PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

Shashi Kant Gupta, Aditya Basu, Mauro Nievas et al.

Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.