CLApr 20

Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design

Xinyao Zhang, Nicole Sonne Heckmann, Manuela Del Castillo Suero, Francesco Paolo Speca, Maurizio Sessa

arXiv:2604.1798814.4h-index: 10

Predicted impact top 73% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For pharmacoepidemiologists, this work shows that general-purpose LLMs with advanced prompting currently outperform specialized biomedical LLMs in study design tasks, but all models have limited coding accuracy.

This study evaluated general-purpose LLMs (GPT-4o, DeepSeek-R1) and biomedical LLMs on pharmacoepidemiologic study design using 46 protocols. GPT-4o with Least-to-Most prompting achieved the highest relevance (median score 4 in 8 of 9 questions) and reasoning, while biomedical LLMs underperformed; all LLMs struggled with ontology-code mapping.

Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.

View on arXiv PDF

Similar