Performance of Large Language Models in Answering Critical Care Medicine Questions
This work addresses the performance gap of large language models in specialized medical fields like Critical Care Medicine, but it is incremental as it applies existing methods to new data.
The study evaluated Meta-Llama 3.1 models on 871 Critical Care Medicine questions, finding that the 70B parameter model outperformed the 8B model by 30% with an average accuracy of 60%, though performance varied across domains from 47.9% to 68.4%.
Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.