AIOct 17, 2022
Review learning: Real world validation of privacy preserving continual learning across medical institutionsJaesung Yoo, Sunghyuk Choi, Ye Seul Yang et al.
When a deep learning model is trained sequentially on different datasets, it often forgets the knowledge learned from previous data, a problem known as catastrophic forgetting. This damages the model's performance on diverse datasets, which is critical in privacy-preserving deep learning (PPDL) applications based on transfer learning (TL). To overcome this, we introduce "review learning" (RevL), a low cost continual learning algorithm for diagnosis prediction using electronic health records (EHR) within a PPDL framework. RevL generates data samples from the model which are used to review knowledge from previous datasets. Six simulated institutional experiments and one real-world experiment involving three medical institutions were conducted to validate RevL, using three binary classification EHR data. In the real-world experiment with data from 106,508 patients, the mean global area under the receiver operating curve was 0.710 for RevL and 0.655 for TL. These results demonstrate RevL's ability to retain previously learned knowledge and its effectiveness in real-world PPDL scenarios. Our work establishes a realistic pipeline for PPDL research based on model transfers across institutions and highlights the practicality of continual learning in real-world medical settings using private EHR data.
CLMar 3, 2024Code
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing ExaminationsSunjun Kweon, Byungjin Choi, Gyouk Chu et al.
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59 large language models, spanning proprietary and open-source models, multilingual and Korean-specialized models, and those fine-tuned for clinical applications. Our results show that applying Chain of Thought (CoT) reasoning can enhance the model performance by up to 4.5% compared to direct answering approaches. We also investigate whether MedQA, one of the most widely used medical benchmarks derived from the U.S. Medical Licensing Examination, can serve as a reliable proxy for evaluating model performance in other regions-in this case, Korea. Our correlation analysis between model scores on KorMedMCQA and MedQA reveals that these two benchmarks align no better than benchmarks from entirely different domains (e.g., MedQA and MMLU-Pro). This finding underscores the substantial linguistic and clinical differences between Korean and U.S. medical contexts, reinforcing the need for region-specific medical QA benchmarks. To support ongoing research in Korean healthcare AI, we publicly release the KorMedMCQA via Huggingface.